Re: A few questiosn about encoding
Le dimanche 23 juin 2013 18:30:40 UTC+2, Steven D'Aprano a écrit : On Sun, 23 Jun 2013 08:51:41 -0700, wxjmfauth wrote: utf-8: how many bytes to hold an a in memory? one byte. flexible string representation: how many bytes to hold an a in memory? One byte? No, two. (Funny, it consumes more memory to hold an ascii char than ascii itself) Incorrect. Python strings have overhead because they are objects, so let's see the difference adding a single character makes: # Python 3.3, with the hated flexible string representation: py sys.getsizeof('a'*100) - sys.getsizeof('a'*99) 1 # Python 3.2: py sys.getsizeof('a'*100) - sys.getsizeof('a'*99) 4 How about a French é character? Of course, ASCII cannot store it *at all*, but let's see what Python can do: # The hated Python 3.3 again: py sys.getsizeof('é'*100) - sys.getsizeof('é'*99) 1 # And Python 3.2: py sys.getsizeof('é'*100) - sys.getsizeof('é'*99) 4 utf-8: In a series of bytes implementing the encoded code points supposed to hold a string, picking a byte and finding to which encoded code point it belongs is a no prolem. Incorrect. UTF-8 is unsuitable for random access, since it has variable- width characters, anything from 1 to 4 bytes. So you cannot just jump directly to character 1000 in a block of text, you have to inspect each byte one-by-one to decide whether it is a 1, 2, 3 or 4 byte character. flexible string representation: In a series of bytes implementing the encoded code points supposed to hold a string, picking a byte and finding to which encoded code point it belongs is ... impossible ! Incorrect. It is absolutely trivial. Each string is marked as either 1- byte, 2-byte or 4-byte. If it is a 1-byte string, then each byte is one character. If it is a 2-byte string, then it is just like Python 3.2 narrow build, and each two bytes is a character. If it is a 4-byte string, then it is just like Python 3.2 wide build, and each four bytes is a character. Within a single string, the number of bytes per character is fixed, and random access is easy and fast. -- Steven :-) -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
Le jeudi 20 juin 2013 19:17:12 UTC+2, MRAB a écrit : On 20/06/2013 17:37, Chris Angelico wrote: On Fri, Jun 21, 2013 at 2:27 AM, wxjmfa...@gmail.com wrote: And all these coding schemes have something in common, they work all with a unique set of code points, more precisely a unique set of encoded code points (not the set of implemented code points (byte)). Just what the flexible string representation is not doing, it artificially devides unicode in subsets and try to handle eache subset differently. UTF-16 divides Unicode into two subsets: BMP characters (encoded using one 16-bit unit) and astral characters (encoded using two 16-bit units in the D800::/5 netblock, or equivalent thereof). Your beloved narrow builds are guilty of exactly the same crime as the hated 3.3. UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4 bytes, and those who previously used ASCII still need only 1 byte per codepoint! Sorry, but no, it does not work in that way: confusion between the set of encoded code points and the implementation of these called code units. utf-8: how many bytes to hold an a in memory? one byte. flexible string representation: how many bytes to hold an a in memory? One byte? No, two. (Funny, it consumes more memory to hold an ascii char than ascii itself) utf-8: In a series of bytes implementing the encoded code points supposed to hold a string, picking a byte and finding to which encoded code point it belongs is a no prolem. flexible string representation: In a series of bytes implementing the encoded code points supposed to hold a string, picking a byte and finding to which encoded code point it belongs is ... impossible ! One of the cause of the bad working of this flexible string representation. The basics of any coding scheme, unicode included. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Sun, 23 Jun 2013 08:51:41 -0700, wxjmfauth wrote: utf-8: how many bytes to hold an a in memory? one byte. flexible string representation: how many bytes to hold an a in memory? One byte? No, two. (Funny, it consumes more memory to hold an ascii char than ascii itself) Incorrect. Python strings have overhead because they are objects, so let's see the difference adding a single character makes: # Python 3.3, with the hated flexible string representation: py sys.getsizeof('a'*100) - sys.getsizeof('a'*99) 1 # Python 3.2: py sys.getsizeof('a'*100) - sys.getsizeof('a'*99) 4 How about a French é character? Of course, ASCII cannot store it *at all*, but let's see what Python can do: # The hated Python 3.3 again: py sys.getsizeof('é'*100) - sys.getsizeof('é'*99) 1 # And Python 3.2: py sys.getsizeof('é'*100) - sys.getsizeof('é'*99) 4 utf-8: In a series of bytes implementing the encoded code points supposed to hold a string, picking a byte and finding to which encoded code point it belongs is a no prolem. Incorrect. UTF-8 is unsuitable for random access, since it has variable- width characters, anything from 1 to 4 bytes. So you cannot just jump directly to character 1000 in a block of text, you have to inspect each byte one-by-one to decide whether it is a 1, 2, 3 or 4 byte character. flexible string representation: In a series of bytes implementing the encoded code points supposed to hold a string, picking a byte and finding to which encoded code point it belongs is ... impossible ! Incorrect. It is absolutely trivial. Each string is marked as either 1- byte, 2-byte or 4-byte. If it is a 1-byte string, then each byte is one character. If it is a 2-byte string, then it is just like Python 3.2 narrow build, and each two bytes is a character. If it is a 4-byte string, then it is just like Python 3.2 wide build, and each four bytes is a character. Within a single string, the number of bytes per character is fixed, and random access is easy and fast. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Wed, 19 Jun 2013 18:46:59 -0700, Rick Johnson wrote: On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote: Gah! That's twice I've screwed that up. Sorry about that! Yeah, and your difficulty explaining the Unicode implementation reminds me of a passage from the Python zen: If the implementation is hard to explain, it's a bad idea. The *implementation* is easy to explain. It's the names of the encodings which I get tangled up in. ASCII: Supports exactly 127 code points, each of which takes up exactly 7 bits. Each code point represents a character. Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and about a gazillion other legacy charsets, all of which are mutually incompatible: supports anything from 127 to 65535 different code points, usually under 256. UCS-2: Supports exactly 65535 code points, each of which takes up exactly two bytes. That's fewer than required, so it is obsoleted by: UTF-16: Supports all 1114111 code points in the Unicode charset, using a variable-width system where the most popular characters use exactly two- bytes and the remaining ones use a pair of characters. UCS-4: Supports exactly 4294967295 code points, each of which takes up exactly four bytes. That is more than needed for the Unicode charset, so this is obsoleted by: UTF-32: Supports all 1114111 code points, using exactly four bytes each. Code points outside of the range 0 through 1114111 inclusive are an error. UTF-8: Supports all 1114111 code points, using a variable-width system where popular ASCII characters require 1 byte, and others use 2, 3 or 4 bytes as needed. Ignoring the legacy charsets, only UTF-16 is a terribly complicated implementation, due to the surrogate pairs. But even that is not too bad. The real complication comes from the interactions between systems which use different encodings, and that's nothing to do with Unicode. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 20/06/2013 07:26, Steven D'Aprano wrote: On Wed, 19 Jun 2013 18:46:59 -0700, Rick Johnson wrote: On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote: Gah! That's twice I've screwed that up. Sorry about that! Yeah, and your difficulty explaining the Unicode implementation reminds me of a passage from the Python zen: If the implementation is hard to explain, it's a bad idea. The *implementation* is easy to explain. It's the names of the encodings which I get tangled up in. You're off by one below! ASCII: Supports exactly 127 code points, each of which takes up exactly 7 bits. Each code point represents a character. 128 codepoints. Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and about a gazillion other legacy charsets, all of which are mutually incompatible: supports anything from 127 to 65535 different code points, usually under 256. 128 to 65536 codepoints. UCS-2: Supports exactly 65535 code points, each of which takes up exactly two bytes. That's fewer than required, so it is obsoleted by: 65536 codepoints. etc. UTF-16: Supports all 1114111 code points in the Unicode charset, using a variable-width system where the most popular characters use exactly two- bytes and the remaining ones use a pair of characters. UCS-4: Supports exactly 4294967295 code points, each of which takes up exactly four bytes. That is more than needed for the Unicode charset, so this is obsoleted by: UTF-32: Supports all 1114111 code points, using exactly four bytes each. Code points outside of the range 0 through 1114111 inclusive are an error. UTF-8: Supports all 1114111 code points, using a variable-width system where popular ASCII characters require 1 byte, and others use 2, 3 or 4 bytes as needed. Ignoring the legacy charsets, only UTF-16 is a terribly complicated implementation, due to the surrogate pairs. But even that is not too bad. The real complication comes from the interactions between systems which use different encodings, and that's nothing to do with Unicode. -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Thursday, June 20, 2013 1:26:17 AM UTC-5, Steven D'Aprano wrote: The *implementation* is easy to explain. It's the names of the encodings which I get tangled up in. Well, ignoring the fact that you're last explanation is still buggy, you have not actually described an implementation, no, you've merely generalized ( and quite vaguely i might add) the technical specification of a few encoding. Let's ask Wikipedia to enlighten us on the subject of implementation: # Define: Implementation # # In computer science, an implementation is a realization # # of a technical specification or algorithm as a program, # # software component, or other computer system through # # computer programming and deployment. Many# # implementations may exist for a given specification or # # standard. For example, web browsers contain # # implementations of World Wide Web Consortium-recommended # # specifications, and software development tools contain # # implementations of programming languages.# Do you think someone could reliably implement the alphabet of a new language in Unicode by using the general outline you provided? -- again, ignoring your continual fumbling when explaining that simple generalization :-) Your generalization is analogous to explaining web browsers as: software that allows a user to view web pages in the range www.* Do you think someone could implement a web browser from such limited specification? (if that was all they knew?). Since we're on the subject of Unicode: One the most humorous aspects of Unicode is that it has encodings for Braille characters. Hmm, this presents a conundrum of sorts. RIDDLE ME THIS?! Since Braille is a type of reading for the blind by utilizing the sense of touch (therefore DEMANDING 3 dimensions) and glyphs derived from Unicode are restrictively two dimensional, because let's face it people, Unicode exists in your computer, and computer screens are two dimensional... but you already knew that -- i think?, then what is the purpose of a Unicode Braille character set? That should haunt your nightmares for some time. -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 2013.06.20 08:40, Rick Johnson wrote: One the most humorous aspects of Unicode is that it has encodings for Braille characters. Hmm, this presents a conundrum of sorts. RIDDLE ME THIS?! Since Braille is a type of reading for the blind by utilizing the sense of touch (therefore DEMANDING 3 dimensions) and glyphs derived from Unicode are restrictively two dimensional, because let's face it people, Unicode exists in your computer, and computer screens are two dimensional... but you already knew that -- i think?, then what is the purpose of a Unicode Braille character set? Two dimensional characters can be made into 3 dimensional shapes. Building numbers are a good example of this. We already have one Unicode troll; do we really need you too? -- CPython 3.3.2 | Windows NT 6.2.9200 / FreeBSD 9.1 -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Thursday, June 20, 2013 9:04:50 AM UTC-5, Andrew Berg wrote: On 2013.06.20 08:40, Rick Johnson wrote: then what is the purpose of a Unicode Braille character set? Two dimensional characters can be made into 3 dimensional shapes. Yes in the real world. But what about on your computer screen? How do you plan on creating tactile representations of braille glyphs on my monitor? Hey, if you can already do this, please share, as it sure would make internet porn more interesting! Building numbers are a good example of this. Either the matrix is reality or you must live inside your computer as a virtual being. Is your name Tron? Are you a pawn of Master Control? He's such a tyrant! -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Fri, Jun 21, 2013 at 1:12 AM, Rick Johnson rantingrickjohn...@gmail.com wrote: On Thursday, June 20, 2013 9:04:50 AM UTC-5, Andrew Berg wrote: On 2013.06.20 08:40, Rick Johnson wrote: then what is the purpose of a Unicode Braille character set? Two dimensional characters can be made into 3 dimensional shapes. Yes in the real world. But what about on your computer screen? How do you plan on creating tactile representations of braille glyphs on my monitor? Hey, if you can already do this, please share, as it sure would make internet porn more interesting! I had a device for creating embossed text. It predated Unicode by a couple of years at least (not sure how many, because I was fairly young at the time). It was made by a company called Epson, it plugged into the computer via a 25-pin plug, and when it was properly functioning, it had a ribbon of ink that it would bash through to darken the underside of the embossed text. But sometimes that ribbon slipped out of position, and we had beautifully-hammered ASCII text, unsullied by ink. And since the device did graphics too, it could be used for the entire Unicode character set if you wanted. Not sure that it would improve your porn any, but I've no doubt you could try if you wanted. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Thu, Jun 20, 2013 at 11:40 PM, Rick Johnson rantingrickjohn...@gmail.com wrote: Your generalization is analogous to explaining web browsers as: software that allows a user to view web pages in the range www.* Do you think someone could implement a web browser from such limited specification? (if that was all they knew?). Wow. That spec isn't limited, it's downright faulty. Or do you really think that (a) there is such a thing as the range www.*, and that (b) that range has anything to do with web browsers? ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
Le jeudi 20 juin 2013 13:43:28 UTC+2, MRAB a écrit : On 20/06/2013 07:26, Steven D'Aprano wrote: On Wed, 19 Jun 2013 18:46:59 -0700, Rick Johnson wrote: On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote: Gah! That's twice I've screwed that up. Sorry about that! Yeah, and your difficulty explaining the Unicode implementation reminds me of a passage from the Python zen: If the implementation is hard to explain, it's a bad idea. The *implementation* is easy to explain. It's the names of the encodings which I get tangled up in. You're off by one below! ASCII: Supports exactly 127 code points, each of which takes up exactly 7 bits. Each code point represents a character. 128 codepoints. Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and about a gazillion other legacy charsets, all of which are mutually incompatible: supports anything from 127 to 65535 different code points, usually under 256. 128 to 65536 codepoints. UCS-2: Supports exactly 65535 code points, each of which takes up exactly two bytes. That's fewer than required, so it is obsoleted by: 65536 codepoints. etc. UTF-16: Supports all 1114111 code points in the Unicode charset, using a variable-width system where the most popular characters use exactly two- bytes and the remaining ones use a pair of characters. UCS-4: Supports exactly 4294967295 code points, each of which takes up exactly four bytes. That is more than needed for the Unicode charset, so this is obsoleted by: UTF-32: Supports all 1114111 code points, using exactly four bytes each. Code points outside of the range 0 through 1114111 inclusive are an error. UTF-8: Supports all 1114111 code points, using a variable-width system where popular ASCII characters require 1 byte, and others use 2, 3 or 4 bytes as needed. Ignoring the legacy charsets, only UTF-16 is a terribly complicated implementation, due to the surrogate pairs. But even that is not too bad. The real complication comes from the interactions between systems which use different encodings, and that's nothing to do with Unicode. And all these coding schemes have something in common, they work all with a unique set of code points, more precisely a unique set of encoded code points (not the set of implemented code points (byte)). Just what the flexible string representation is not doing, it artificially devides unicode in subsets and try to handle eache subset differently. On this other side, that is because it is impossible to work properly with multiple sets of encoded code points that all these coding schemes exist today. There are simply no other way. Even exotic schemes like CID-fonts used in pdf are based on that scheme. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Fri, Jun 21, 2013 at 2:27 AM, wxjmfa...@gmail.com wrote: And all these coding schemes have something in common, they work all with a unique set of code points, more precisely a unique set of encoded code points (not the set of implemented code points (byte)). Just what the flexible string representation is not doing, it artificially devides unicode in subsets and try to handle eache subset differently. UTF-16 divides Unicode into two subsets: BMP characters (encoded using one 16-bit unit) and astral characters (encoded using two 16-bit units in the D800::/5 netblock, or equivalent thereof). Your beloved narrow builds are guilty of exactly the same crime as the hated 3.3. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
Rick Johnson rantingrickjohn...@gmail.com wrote: Since we're on the subject of Unicode: One the most humorous aspects of Unicode is that it has encodings for Braille characters. Hmm, this presents a conundrum of sorts. RIDDLE ME THIS?! Since Braille is a type of reading for the blind by utilizing the sense of touch (therefore DEMANDING 3 dimensions) and glyphs derived from Unicode are restrictively two dimensional, because let's face it people, Unicode exists in your computer, and computer screens are two dimensional... but you already knew that -- i think?, then what is the purpose of a Unicode Braille character set? That should haunt your nightmares for some time. From http://www.unicode.org/versions/Unicode6.2.0/ch15.pdf The intent of encoding the 256 Braille patterns in the Unicode Standard is to allow input and output devices to be implemented that can interchange Braille data without having to go through a context-dependent conversion from semantic values to patterns, or vice versa. In this manner, final-form documents can be exchanged and faithfully rendered. http://files.pef-format.org/specifications/pef-2008-1/pef-specification.html#Unicode I wish you a pleasant sleep tonight. Bye, Andreas -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 20/06/2013 17:37, Chris Angelico wrote: On Fri, Jun 21, 2013 at 2:27 AM, wxjmfa...@gmail.com wrote: And all these coding schemes have something in common, they work all with a unique set of code points, more precisely a unique set of encoded code points (not the set of implemented code points (byte)). Just what the flexible string representation is not doing, it artificially devides unicode in subsets and try to handle eache subset differently. UTF-16 divides Unicode into two subsets: BMP characters (encoded using one 16-bit unit) and astral characters (encoded using two 16-bit units in the D800::/5 netblock, or equivalent thereof). Your beloved narrow builds are guilty of exactly the same crime as the hated 3.3. UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4 bytes, and those who previously used ASCII still need only 1 byte per codepoint! -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Fri, Jun 21, 2013 at 3:17 AM, MRAB pyt...@mrabarnett.plus.com wrote: On 20/06/2013 17:37, Chris Angelico wrote: On Fri, Jun 21, 2013 at 2:27 AM, wxjmfa...@gmail.com wrote: And all these coding schemes have something in common, they work all with a unique set of code points, more precisely a unique set of encoded code points (not the set of implemented code points (byte)). Just what the flexible string representation is not doing, it artificially devides unicode in subsets and try to handle eache subset differently. UTF-16 divides Unicode into two subsets: BMP characters (encoded using one 16-bit unit) and astral characters (encoded using two 16-bit units in the D800::/5 netblock, or equivalent thereof). Your beloved narrow builds are guilty of exactly the same crime as the hated 3.3. UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4 bytes, and those who previously used ASCII still need only 1 byte per codepoint! Yes, but there's never (AFAIK) been a Python implementation that represents strings in UTF-8; UTF-16 was one of two options for Python 2.2 through 3.2, and is the one that jmf always seems to be measuring against. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
Rick Johnson writes: On Thursday, June 20, 2013 9:04:50 AM UTC-5, Andrew Berg wrote: On 2013.06.20 08:40, Rick Johnson wrote: then what is the purpose of a Unicode Braille character set? Two dimensional characters can be made into 3 dimensional shapes. Yes in the real world. But what about on your computer screen? How do you plan on creating tactile representations of braille glyphs on my monitor? Hey, if you can already do this, please share, as it sure would make internet porn more interesting! Search for braille display on the web. A wikipedia article also led me to braille e-book. (Or search for braille porn, since you are so inclined - the concept turns out to be already out there on the web.) -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 20/06/2013 17:27, wxjmfa...@gmail.com wrote: Le jeudi 20 juin 2013 13:43:28 UTC+2, MRAB a écrit : On 20/06/2013 07:26, Steven D'Aprano wrote: On Wed, 19 Jun 2013 18:46:59 -0700, Rick Johnson wrote: On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote: Gah! That's twice I've screwed that up. Sorry about that! Yeah, and your difficulty explaining the Unicode implementation reminds me of a passage from the Python zen: If the implementation is hard to explain, it's a bad idea. The *implementation* is easy to explain. It's the names of the encodings which I get tangled up in. You're off by one below! ASCII: Supports exactly 127 code points, each of which takes up exactly 7 bits. Each code point represents a character. 128 codepoints. Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and about a gazillion other legacy charsets, all of which are mutually incompatible: supports anything from 127 to 65535 different code points, usually under 256. 128 to 65536 codepoints. UCS-2: Supports exactly 65535 code points, each of which takes up exactly two bytes. That's fewer than required, so it is obsoleted by: 65536 codepoints. etc. UTF-16: Supports all 1114111 code points in the Unicode charset, using a variable-width system where the most popular characters use exactly two- bytes and the remaining ones use a pair of characters. UCS-4: Supports exactly 4294967295 code points, each of which takes up exactly four bytes. That is more than needed for the Unicode charset, so this is obsoleted by: UTF-32: Supports all 1114111 code points, using exactly four bytes each. Code points outside of the range 0 through 1114111 inclusive are an error. UTF-8: Supports all 1114111 code points, using a variable-width system where popular ASCII characters require 1 byte, and others use 2, 3 or 4 bytes as needed. Ignoring the legacy charsets, only UTF-16 is a terribly complicated implementation, due to the surrogate pairs. But even that is not too bad. The real complication comes from the interactions between systems which use different encodings, and that's nothing to do with Unicode. And all these coding schemes have something in common, they work all with a unique set of code points, more precisely a unique set of encoded code points (not the set of implemented code points (byte)). Just what the flexible string representation is not doing, it artificially devides unicode in subsets and try to handle eache subset differently. On this other side, that is because it is impossible to work properly with multiple sets of encoded code points that all these coding schemes exist today. There are simply no other way. Even exotic schemes like CID-fonts used in pdf are based on that scheme. jmf I entirely agree with the viewpoints of jmfauth, Nick the Greek, rr, Xah Lee and Ilias Lazaridis on the grounds that disagreeing and stating my beliefs ends up with the Python Mailing List police standing on my back doorsetep. Give me the NSA or GCHQ any day of the week :( -- Steve is going for the pink ball - and for those of you who are watching in black and white, the pink is next to the green. Snooker commentator 'Whispering' Ted Lowe. Mark Lawrence -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote: Gah! That's twice I've screwed that up. Sorry about that! Yeah, and your difficulty explaining the Unicode implementation reminds me of a passage from the Python zen: If the implementation is hard to explain, it's a bad idea. -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
Op 15-06-13 02:28, Cameron Simpson schreef: On 14Jun2013 15:59, Nikos as SuperHost Support supp...@superhost.gr wrote: | So, a numeral = a string representation of a number. Is this correct? No, a numeral is an individual digit from the string representation of a number. So: 65 requires two numerals: '6' and '5'. Wrong context. A numeral as an individual digit is when you are talking about individual characters in a font. In such a context the set of glyphs that represent a digit are the numerals. However in a context of programming, numerals in general refer to the set of strings that represent a number. -- Antoon. -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 17Jun2013 08:49, Antoon Pardon antoon.par...@rece.vub.ac.be wrote: | Op 15-06-13 02:28, Cameron Simpson schreef: | On 14Jun2013 15:59, Nikos as SuperHost Support supp...@superhost.gr wrote: | | So, a numeral = a string representation of a number. Is this correct? | | No, a numeral is an individual digit from the string representation of a number. | So: 65 requires two numerals: '6' and '5'. | | Wrong context. A numeral as an individual digit is when you are talking about | individual characters in a font. In such a context the set of glyphs that | represent a digit are the numerals. | | However in a context of programming, numerals in general refer to the set of | strings that represent a number. No, those are just numbers or numeric strings (if you're being overt about them being strings at all). They're numeric strings because they're composed of numerals. If you think otherwise your vocabulary needs adjusting. A numeral is a single digit. -- Cameron Simpson c...@zip.com.au English is a living language, but simple illiteracy is no basis for linguistic evolution. - Dwight MacDonald -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
Op 17-06-13 09:08, Cameron Simpson schreef: On 17Jun2013 08:49, Antoon Pardon antoon.par...@rece.vub.ac.be wrote: | Op 15-06-13 02:28, Cameron Simpson schreef: | On 14Jun2013 15:59, Nikos as SuperHost Support supp...@superhost.gr wrote: | | So, a numeral = a string representation of a number. Is this correct? | | No, a numeral is an individual digit from the string representation of a number. | So: 65 requires two numerals: '6' and '5'. | | Wrong context. A numeral as an individual digit is when you are talking about | individual characters in a font. In such a context the set of glyphs that | represent a digit are the numerals. | | However in a context of programming, numerals in general refer to the set of | strings that represent a number. No, those are just numbers or numeric strings (if you're being overt about them being strings at all). They're numeric strings because they're composed of numerals. If you think otherwise your vocabulary needs adjusting. A numeral is a single digit. -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
Op 17-06-13 09:08, Cameron Simpson schreef: On 17Jun2013 08:49, Antoon Pardon antoon.par...@rece.vub.ac.be wrote: | Op 15-06-13 02:28, Cameron Simpson schreef: | On 14Jun2013 15:59, Nikos as SuperHost Support supp...@superhost.gr wrote: | | So, a numeral = a string representation of a number. Is this correct? | | No, a numeral is an individual digit from the string representation of a number. | So: 65 requires two numerals: '6' and '5'. | | Wrong context. A numeral as an individual digit is when you are talking about | individual characters in a font. In such a context the set of glyphs that | represent a digit are the numerals. | | However in a context of programming, numerals in general refer to the set of | strings that represent a number. No, those are just numbers or numeric strings (if you're being overt about them being strings at all). They're numeric strings because they're composed of numerals. If you think otherwise your vocabulary needs adjusting. A numeral is a single digit. Just because you are unfamiliar with a context in which numeral means a representation of a number, doesn't imply my vocabularly needs adjusting. -- Antoon Pardon -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Sat, Jun 15, 2013 at 10:35 PM, Benjamin Schollnick benja...@schollnick.net wrote: Nick, The only thing that i didn't understood is this line. First please tell me what is a byte value \x1b is a sequence you find inside strings (and byte strings, the b'...' format). \x1b is a character(ESC) represented in hex format b'\x1b' is a byte object that represents what? chr(27).encode('utf-8') b'\x1b' b'\x1b'.decode('utf-8') '\x1b' After decoding it gives the char ESC in hex format Shouldn't it result in value 27 which is the ordinal of ESC ? I'm sorry are you not listening? 1b is a HEXADECIMAL Number. As a so-called programmer, did you seriously not consider that? Try this: 1) Open a Web browser 2) Go to Google.com 3) Type in Hex 1B 4) Click on the first link 5) In the Hexadecimal column find 1B. Or open your favorite calculator, and convert Hexadecimal 1B to Decimal (Base 10). - Benjamin -- http://mail.python.org/mailman/listinfo/python-list Better: a programmer should know how to convert hexadecimal to decimal. 0x1B = (0x1 * 16^1) + (0xB * 16^0) = (1 * 16) + (11 * 1) = 16 + 11 = 27 It’s that easy, and a programmer should be able to do that in their brain, at least with small numbers. Or at least know that: http://lmgtfy.com/?q=0x1B+in+decimal Or at least `hex(27)`; or '`{:X}'.format(27)`; or `'%X' % 27`. (I despise hex numbers with lowercase letters, but that’s my personal opinion.) -- Kwpolska http://kwpolska.tk | GPG KEY: 5EAAEA16 stop html mail| always bottom-post http://asciiribbon.org| http://caliburn.nl/topposting.html -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Fri, 14 Jun 2013 16:58:20 +0300, Nick the Gr33k wrote: On 14/6/2013 1:14 μμ, Cameron Simpson wrote: Normally a character in a b'...' item represents the byte value matching the character's Unicode ordinal value. The only thing that i didn't understood is this line. First please tell me what is a byte value Seriously? You don't understand the term byte? And you're the support desk for a webhosting company? -- Denis McMahon, denismfmcma...@gmail.com -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 2013-06-15, Denis McMahon denismfmcma...@gmail.com wrote: On Fri, 14 Jun 2013 16:58:20 +0300, Nick the Gr33k wrote: On 14/6/2013 1:14 , Cameron Simpson wrote: Normally a character in a b'...' item represents the byte value matching the character's Unicode ordinal value. The only thing that i didn't understood is this line. First please tell me what is a byte value Seriously? You don't understand the term byte? And you're the support desk for a webhosting company? Well, we haven't had this thread for a week or so... There is some ambiguity in the term byte. It used to mean the smallest addressable unit of memory (which varied in the past -- at one point, both 20 and 60 bit bytes were common). These days the smallest addressable unit of memory is almost always 8 bits on desktop and embedded processors (but often not on DSPs). That's why when IEEE stadards want to refer to an 8-bit chunk of data they use the term octet. :) -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 15/6/2013 5:44 μμ, Grant Edwards wrote: There is some ambiguity in the term byte. It used to mean the smallest addressable unit of memory (which varied in the past -- at one point, both 20 and 60 bit bytes were common). These days the smallest addressable unit of memory is almost always 8 bits on desktop and embedded processors (but often not on DSPs). That's why when IEEE stadards want to refer to an 8-bit chunk of data they use the term octet. What the difference between a byte and a byte's value? -- What is now proved was at first only imagined! -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
In article kphul7$74q$1...@reader1.panix.com, Grant Edwards invalid@invalid.invalid wrote: There is some ambiguity in the term byte. It used to mean the smallest addressable unit of memory (which varied in the past -- at one point, both 20 and 60 bit bytes were common). I would have defined it more like, some arbitrary collection of adjacent bits which hold some useful value. Doesn't need to be addressable, nor does it need to be the smallest such thing. For example, on the pdp-10 (36 bit word), it was common to treat a word as either four 9-bit bytes, or five 7-bit bytes (with one bit left over), depending on what you were doing. And, of course, a nybble was something smaller than a byte! And, yes, especially in networking, everybody talks about octets when they want to make sure people understand what they mean. -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 15/6/2013 5:59 μμ, Roy Smith wrote: And, yes, especially in networking, everybody talks about octets when they want to make sure people understand what they mean. 1 byte = 8 bits in networking though since we do not use encoding schemes with variable lengths like utf-8 is, how do we separate when a byte value start and when it stops? do we need a start bit and a stop bit for that? -- What is now proved was at first only imagined! -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Sat, 15 Jun 2013 17:49:13 +0300, Nick the Gr33k wrote: What the difference between a byte and a byte's value? Nothing. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Sat, Jun 15, 2013 at 11:14 AM, Nick the Gr33k supp...@superhost.grwrote: On 15/6/2013 5:59 μμ, Roy Smith wrote: And, yes, especially in networking, everybody talks about octets when they want to make sure people understand what they mean. 1 byte = 8 bits in networking though since we do not use encoding schemes with variable lengths like utf-8 is, how do we separate when a byte value start and when it stops? do we need a start bit and a stop bit for that? And this is specific to python how? -- What is now proved was at first only imagined! -- http://mail.python.org/**mailman/listinfo/python-listhttp://mail.python.org/mailman/listinfo/python-list -- Joel Goldstick http://joelgoldstick.com -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14/6/2013 4:58 μμ, Nick the Gr33k wrote: On 14/6/2013 1:14 μμ, Cameron Simpson wrote: Normally a character in a b'...' item represents the byte value matching the character's Unicode ordinal value. The only thing that i didn't understood is this line. First please tell me what is a byte value \x1b is a sequence you find inside strings (and byte strings, the b'...' format). \x1b is a character(ESC) represented in hex format b'\x1b' is a byte object that represents what? chr(27).encode('utf-8') b'\x1b' b'\x1b'.decode('utf-8') '\x1b' After decoding it gives the char ESC in hex format Shouldn't it result in value 27 which is the ordinal of ESC ? No, I mean conceptually, there is no difference between a code-point and its ordinal value. They are the same thing. Why Unicode charset doesn't just contain characters, but instead it contains a mapping of (characters -- ordinals) ? I mean what we do is to encode a character like chr(65).encode('utf-8') What's the reason of existence of its corresponding ordinal value since it doesn't get involved into the encoding process? Thank you very much for taking the time to explain. Can someone please explain these questions too? -- What is now proved was at first only imagined! -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
Nick, The only thing that i didn't understood is this line. First please tell me what is a byte value \x1b is a sequence you find inside strings (and byte strings, the b'...' format). \x1b is a character(ESC) represented in hex format b'\x1b' is a byte object that represents what? chr(27).encode('utf-8') b'\x1b' b'\x1b'.decode('utf-8') '\x1b' After decoding it gives the char ESC in hex format Shouldn't it result in value 27 which is the ordinal of ESC ? I'm sorry are you not listening? 1b is a HEXADECIMAL Number. As a so-called programmer, did you seriously not consider that? Try this: 1) Open a Web browser 2) Go to Google.com 3) Type in Hex 1B 4) Click on the first link 5) In the Hexadecimal column find 1B. Or open your favorite calculator, and convert Hexadecimal 1B to Decimal (Base 10). - Benjamin -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
: On 14 June 2013 01:34, Nick the Gr33k supp...@superhost.gr wrote: Why doesn't it work like this? leading 0 = 1 byte flag leading 1 = 2 bytes flag leading 00 = 3 bytes flag leading 01 = 4 bytes flag leading 10 = 5 bytes flag leading 11 = 6 bytes flag Wouldn't it be more logical? Think about it. Let's say that, as per your scheme, a leading 0 indicates 1 byte (as is indeed the case in UTF8). What things could follow that leading 0? How does that impact your choice of a leading 00 or 01 for other numbers of bytes? ... okay, you're obviously going to need to be spoon-fed a little more than that. Here's a byte: 01010101 Is that a single byte representing a code point in the 0-127 range, or the first of 4 bytes representing something else, in your proposed scheme? How can you tell? Now look at the way UTF8 does it: http://en.wikipedia.org/wiki/Utf-8#Description Really, follow the link and study the table carefully. Don't continue reading this until you believe you understand the choices that the designers of UTF8 made, and why they made them. Pay particular attention to the possible values for byte 1. Do you notice the difference between that scheme, and yours: 0xxx 1xxx 00xx 01xx 10xx 11xx If you don't see it, keep looking until you do ... this email gives you more than enough hints to work it out. Don't ask someone here to explain it to you. If you want to become competent, you must use your brain. -[]z. -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14/6/2013 4:00 πμ, Cameron Simpson wrote: On 13Jun2013 17:19, Nikos as SuperHost Support supp...@superhost.gr wrote: | A code-point and the code-point's ordinal value are associated into | a Unicode charset. They have the so called 1:1 mapping. | | So, i was under the impression that by encoding the code-point into | utf-8 was the same as encoding the code-point's ordinal value into | utf-8. | | So, now i believe they are two different things. | The code-point *is what actually* needs to be encoded and *not* its | ordinal value. Because there is a 1:1 mapping, these are the same thing: a code point is directly _represented_ by the ordinal value, and the ordinal value is encoded for storage as bytes. So, you are saying that: chr(16474).encode('utf-8') #being the code-point encoded ord(chr(16474)).encode('utf-8') #being the code-point's ordinal encoded which gives an error. that shows us that a character is what is being be encoded to utf-8 but the character's ordinal cannot. So, whay you say and the ordinal value is encoded for storage as bytes. ? | The leading 0b is just syntax to tell you this is base 2, not base 8 | (0o) or base 10 or base 16 (0x). Also, leading zero bits are dropped. | | But byte objects are represented as '\x' instead of the | aforementioned '0x'. Why is that? You're confusing a string representation of a single number in some base (eg 2 or 16) with the string-ish representation of a bytes object. bin(16474) '0b10001011010' that is a binary format string representation of number 16474, yes? hex(16474) '0x405a' that is a hexadecimal format string representation of number 16474, yes? WHILE: b'abc\x1b\n' = a string representation of a byte, which in turn is a series of integers, so that makes this a string representation of integers, is this correct? \x1b = ESC character \ = for seperating bytes x = to flag that the following bytes are going to be represented as hex values? whats exactly 'x' means here? character perhaps? Still its not clear into my head what the difference of '0x1b' and '\x1b' is: i think: 0x1b = an integer represented in hex format \x1b = a character represented in hex format id this true? | How can i view this byte's object representation as hex() or as bin()? See above. A bytes is a _sequence_ of values. hex() and bin() print individual values in hexadecimal or binary respectively. for value in b'\x97\x98\x99\x27\x10': ... print(value, hex(value), bin(value)) ... 151 0x97 0b10010111 152 0x98 0b10011000 153 0x99 0b10011001 39 0x27 0b100111 16 0x10 0b1 for value in b'abc\x1b\n': ... print(value, hex(value), bin(value)) ... 97 0x61 0b111 98 0x62 0b1100010 99 0x63 0b1100011 27 0x1b 0b11011 10 0xa 0b1010 Why these two give different values when printed? -- What is now proved was at first only imagined! -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14/6/2013 9:00 πμ, Zero Piraeus wrote: : On 14 June 2013 01:34, Nick the Gr33k supp...@superhost.gr wrote: Why doesn't it work like this? leading 0 = 1 byte flag leading 1 = 2 bytes flag leading 00 = 3 bytes flag leading 01 = 4 bytes flag leading 10 = 5 bytes flag leading 11 = 6 bytes flag Wouldn't it be more logical? Think about it. Let's say that, as per your scheme, a leading 0 indicates 1 byte (as is indeed the case in UTF8). What things could follow that leading 0? How does that impact your choice of a leading 00 or 01 for other numbers of bytes? ... okay, you're obviously going to need to be spoon-fed a little more than that. Here's a byte: 01010101 Is that a single byte representing a code point in the 0-127 range, or the first of 4 bytes representing something else, in your proposed scheme? How can you tell? Indeed. You cannot tell if it stands for 1 byte or a 4 byte sequence: 0 + 1010101 = leading 0 stands for 1byte representation of a code-point 01 + 010101 = leading 01 stands for 4byte representation of a code-point the problem here in my scheme of how utf8 encoding works is that you cannot tell whether the flag is '0' or '01' Same happen with leading '1' and '11'. You cannot tell what the flag is, so you cannot know if the Unicode code-point is being represented as 2-byte sequence or 6 bye sequence Understood Now look at the way UTF8 does it: http://en.wikipedia.org/wiki/Utf-8#Description Really, follow the link and study the table carefully. Don't continue reading this until you believe you understand the choices that the designers of UTF8 made, and why they made them. Pay particular attention to the possible values for byte 1. Do you notice the difference between that scheme, and yours: 0xxx 1xxx 00xx 01xx 10xx 11xx If you don't see it, keep looking until you do ... this email gives you more than enough hints to work it out. Don't ask someone here to explain it to you. If you want to become competent, you must use your brain. 0xxx 110x10xx 111010xx10xx 0xxx10xx10xx10xx I did read the link but i still cannot see why 1. '110' is the flag for 2-byte code-point 2. why the in the 2nd byte and every subsequent byte leading flag has to be '10' -- What is now proved was at first only imagined! -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
Op 13-06-13 10:08, Νικόλαος Κούρας schreef: On 13/6/2013 10:58 πμ, Chris Angelico wrote: On Thu, Jun 13, 2013 at 5:42 PM, �� supp...@superhost.gr wrote: On 13/6/2013 10:11 ��, Steven D'Aprano wrote: No! That creates a string from 16474 in base two: '0b10001011010' I disagree here. 16474 is a number in base 10. Doing bin(16474) we get the binary representation of number 16474 and not a string. Why you say we receive a string while python presents a binary number? You can disagree all you like. Steven cited a simple point of fact, one which can be verified in any Python interpreter. Nikos, you are flat wrong here; bin(16474) creates a string. Indeed python embraced it in single quoting '0b10001011010' and not as 0b10001011010 which in fact makes it a string. But since bin(16474) seems to create a string rather than an expected number(at leat into my mind) then how do we get the binary representation of the number 16474 as a number? You don't. You should remember that python (or any programming language) doesn't print numbers. It always prints string representations of numbers. It is just so that we are so used to the decimal representation that we think of that representation as being the number. Normally that is not a problem but it can cause confusion when you are working with mulitple representations. -- Antoon Pardon -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14/6/2013 10:36 πμ, Antoon Pardon wrote: Op 13-06-13 10:08, Νικόλαος Κούρας schreef: On 13/6/2013 10:58 πμ, Chris Angelico wrote: On Thu, Jun 13, 2013 at 5:42 PM, �� supp...@superhost.gr wrote: On 13/6/2013 10:11 ��, Steven D'Aprano wrote: No! That creates a string from 16474 in base two: '0b10001011010' I disagree here. 16474 is a number in base 10. Doing bin(16474) we get the binary representation of number 16474 and not a string. Why you say we receive a string while python presents a binary number? You can disagree all you like. Steven cited a simple point of fact, one which can be verified in any Python interpreter. Nikos, you are flat wrong here; bin(16474) creates a string. Indeed python embraced it in single quoting '0b10001011010' and not as 0b10001011010 which in fact makes it a string. But since bin(16474) seems to create a string rather than an expected number(at leat into my mind) then how do we get the binary representation of the number 16474 as a number? You don't. You should remember that python (or any programming language) doesn't print numbers. It always prints string representations of numbers. It is just so that we are so used to the decimal representation that we think of that representation as being the number. Normally that is not a problem but it can cause confusion when you are working with mulitple representations. Hold on! Youa re basically saying here that: 16474 16474 is nto a number as we think but instead is string representation of a number? I dont think so, if it were a string representation of a number that would print the following: 16474 '16474' Python prints numbers: 16474 16474 0b10001011010 16474 0x405a 16474 it prints them all to decimal format though. but when we need a decimal integer to be turned into bin() or hex() we can bin(number) hex(number) and just remove the pair of single quoting. -- What is now proved was at first only imagined! -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
Op 14-06-13 09:49, Nick the Gr33k schreef: On 14/6/2013 10:36 πμ, Antoon Pardon wrote: Op 13-06-13 10:08, Νικόλαος Κούρας schreef: Indeed python embraced it in single quoting '0b10001011010' and not as 0b10001011010 which in fact makes it a string. But since bin(16474) seems to create a string rather than an expected number(at leat into my mind) then how do we get the binary representation of the number 16474 as a number? You don't. You should remember that python (or any programming language) doesn't print numbers. It always prints string representations of numbers. It is just so that we are so used to the decimal representation that we think of that representation as being the number. Normally that is not a problem but it can cause confusion when you are working with mulitple representations. Hold on! Youa re basically saying here that: 16474 16474 is nto a number as we think but instead is string representation of a number? Yes, or if you prefer what python prints is the decimal notation of the number. I dont think so, if it were a string representation of a number that would print the following: 16474 '16474' No it wouldn't, You are confusing representation in the everyday meaning with representation as python jargon. Python prints numbers: No it doesn't, numbers are abstract concepts that can be represented in various notations, these notations are strings. Those notaional strings end up being printed. As I said before we are so used in using the decimal notation that we often use the notation and the number interchangebly without a problem. But when we are working with multiple notations that can become confusing and we should be careful to seperate numbers from their representaions/notations. but when we need a decimal integer There are no decimal integers. There is only a decimal notation of the number. Decimal, octal etc are not characteristics of the numbers themselves. -- Antoon Pardon -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14/6/2013 11:22 πμ, Antoon Pardon wrote: Python prints numbers: No it doesn't, numbers are abstract concepts that can be represented in various notations, these notations are strings. Those notaional strings end up being printed. As I said before we are so used in using the decimal notation that we often use the notation and the number interchangebly without a problem. But when we are working with multiple notations that can become confusing and we should be careful to seperate numbers from their representaions/notations. How do we separate a number then from its represenation-natation? What is a notation anywat? is it a way of displayment? but that would be a represeantion then Please explain this line as it uses both terms. No it doesn't, numbers are abstract concepts that can be represented in various notations but when we need a decimal integer There are no decimal integers. There is only a decimal notation of the number. Decimal, octal etc are not characteristics of the numbers themselves. So everything we see like: 16474 nikos abc123 everything is a string and nothing is a number? not even number 1? -- What is now proved was at first only imagined! -- http://mail.python.org/mailman/listinfo/python-list
Don't feed the troll... (was: Re: A few questiosn about encoding)
Am 14.06.2013 10:37, schrieb Nick the Gr33k: So everything we see like: 16474 nikos abc123 everything is a string and nothing is a number? not even number 1? Come on now, this is _so_ obviously trolling, it's not even remotely funny anymore. Why doesn't killfiling work with the mailing list version of the python list? :-( -- --- Heiko. -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14Jun2013 11:37, Nikos as SuperHost Support supp...@superhost.gr wrote: | On 14/6/2013 11:22 πμ, Antoon Pardon wrote: | | Python prints numbers: | No it doesn't, numbers are abstract concepts that can be represented in | various notations, these notations are strings. Those notaional strings | end up being printed. As I said before we are so used in using the | decimal notation that we often use the notation and the number interchangebly | without a problem. But when we are working with multiple notations that | can become confusing and we should be careful to seperate numbers from their | representaions/notations. | | How do we separate a number then from its represenation-natation? Shrug. When you print a number, Python transcribes a string representation of it to your terminal. | What is a notation anywat? is it a way of displayment? but that | would be a represeantion then Yep. Same thing. A notation is a particulart formal method of representation. | No it doesn't, numbers are abstract concepts that can be represented in | various notations | | but when we need a decimal integer | | There are no decimal integers. There is only a decimal notation of the number. | Decimal, octal etc are not characteristics of the numbers themselves. | | So everything we see like: | | 16474 | nikos | abc123 | | everything is a string and nothing is a number? not even number 1? Everything you see like that is textual information. Internally to Python, various types are used: strings, bytes, integers etc. But when you print something, text is output. Cheers, -- Cameron Simpson c...@zip.com.au A long-forgotten loved one will appear soon. Buy the negatives at any price. -- http://mail.python.org/mailman/listinfo/python-list
Re: Don't feed the troll... (was: Re: A few questiosn about encoding)
On 14 Jun 2013 10:20, Heiko Wundram modeln...@modelnine.org wrote: Am 14.06.2013 10:37, schrieb Nick the Gr33k: So everything we see like: 16474 nikos abc123 everything is a string and nothing is a number? not even number 1? Come on now, this is _so_ obviously trolling, it's not even remotely funny anymore. Why doesn't killfiling work with the mailing list version of the python list? :-( I have skimmed the archives for this month, and I estimate that a third of this month's activity on this list was helping this person. About 80% of that is wasted in explaining basic concepts he refuses to read in links given to him. A depressingly large number of replies to his posts are seemingly ignored. Since this is a lot of spam, I feel like leaving the list, but I also honestly want to help people use python and the replies to questions of others often give me much insight on several matters. -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14Jun2013 09:59, Nikos as SuperHost Support supp...@superhost.gr wrote: | On 14/6/2013 4:00 πμ, Cameron Simpson wrote: | On 13Jun2013 17:19, Nikos as SuperHost Support supp...@superhost.gr wrote: | | A code-point and the code-point's ordinal value are associated into | | a Unicode charset. They have the so called 1:1 mapping. | | | | So, i was under the impression that by encoding the code-point into | | utf-8 was the same as encoding the code-point's ordinal value into | | utf-8. | | | | So, now i believe they are two different things. | | The code-point *is what actually* needs to be encoded and *not* its | | ordinal value. | | Because there is a 1:1 mapping, these are the same thing: a code | point is directly _represented_ by the ordinal value, and the ordinal | value is encoded for storage as bytes. | | So, you are saying that: | | chr(16474).encode('utf-8') #being the code-point encoded | | ord(chr(16474)).encode('utf-8') #being the code-point's ordinal | encoded which gives an error. | | that shows us that a character is what is being be encoded to utf-8 | but the character's ordinal cannot. | | So, whay you say and the ordinal value is encoded for storage | as bytes. ? No, I mean conceptually, there is no difference between a codepoint and its ordinal value. They are the same thing. Inside Python itself, a character (a string of length 1; there is no separate character type) is a distinct type. Interally, the characters in a string are stored numericly. As Unicode codepoints, as their ordinal values. It is a meaningful idea to store a Python string encoded into bytes using some text encoding scheme (utf-8, iso-8859-7, what have you). It is not a meaningful thing to store a number encoded without some more context. The .encode() method that accepts an encoding name like utf-8 is specificly an encoding procedure FOR TEXT. So strings have such a method, and integers do not. When you write: chr(16474) you receive a _string_, containing the single character whose ordinal is 16474. It is meaningful to transcribe this string to bytes using a text encoding procedure like 'utf-8'. When you write: ord(chr(16474)) you get an integer. Because ord() is the reverse of chr(), you get the integer 16474. Integers do not have .encode() methods that accept a _text_ encoding name like 'utf-8' because integers are not text. | | The leading 0b is just syntax to tell you this is base 2, not base 8 | | (0o) or base 10 or base 16 (0x). Also, leading zero bits are dropped. | | | | But byte objects are represented as '\x' instead of the | | aforementioned '0x'. Why is that? | | You're confusing a string representation of a single number in | some base (eg 2 or 16) with the string-ish representation of a | bytes object. | | bin(16474) | '0b10001011010' | that is a binary format string representation of number 16474, yes? Yes. | hex(16474) | '0x405a' | that is a hexadecimal format string representation of number 16474, yes? Yes. | WHILE: | b'abc\x1b\n' = a string representation of a byte, which in turn is a | series of integers, so that makes this a string representation of | integers, is this correct? A bytes Python object. So not a byte, 5 bytes. It is a string representation of the series of byte values, ON THE PREMISE that the bytes may well represent text. On that basis, b'abc\x1b\n' is a reasonable way to display them. In other contexts this might not be a sensible way to display these bytes, and then another format would be chosen, possibly hand constructed by the programmer, or equally reasonable, the hexlify() function from the binascii module. | \x1b = ESC character Considering the bytes to be representing characters, then yes. | \ = for seperating bytes No, \ to introduce a sequence of characters with special meaning. Normally a character in a b'...' item represents the byte value matching the character's Unicode ordinal value. But several characters are hard or confusing to place literally in a b'...' string. For example a newline character or and escape character. 'a' means 65. '\n' means 10 (newline, hence the 'n'). '\x1b' means 33 (escape, value 27, value 0x1b in hexadecimal). And, of course, '\\' means a literal slosh, value 92. | x = to flag that the following bytes are going to be represented as | hex values? whats exactly 'x' means here? character perhaps? A slosh followed by an 'x' means there will be 2 hexadecimal digits to follow, and those two digits represent the byte value. So, yes. | Still its not clear into my head what the difference of '0x1b' and | '\x1b' is: They're the same thing in two similar but slightly different formats. 0x1b is a legitimate bare integer value in Python. \x1b is a sequence you find inside strings (and byte strings, the b'...' format). | i think: | 0x1b = an integer represented in hex format Yes. | \x1b = a character represented in hex format Yes. | | How can i view this byte's object representation as
Re: A few questiosn about encoding
Op 14-06-13 10:37, Nick the Gr33k schreef: On 14/6/2013 11:22 πμ, Antoon Pardon wrote: Python prints numbers: No it doesn't, numbers are abstract concepts that can be represented in various notations, these notations are strings. Those notaional strings end up being printed. As I said before we are so used in using the decimal notation that we often use the notation and the number interchangebly without a problem. But when we are working with multiple notations that can become confusing and we should be careful to seperate numbers from their representaions/notations. How do we separate a number then from its represenation-natation? What do you mean? Internally there is no representation linked to the number, so there is nothing to be seperated. Only when a number needs to be printed, is a representation for that number built and displayed. What is a notation anywat? is it a way of displayment? but that would be a represeantion then Yes a notation is a representation. However represenation is also a bit of python jargon that has a specific meaning. So in order to not confuse with multiple possible meanings for representation I chose to use notation There are no decimal integers. There is only a decimal notation of the number. Decimal, octal etc are not characteristics of the numbers themselves. So everything we see like: 16474 nikos abc123 everything is a string and nothing is a number? not even number 1? There is a difference between everything we see as you write earlier and just plain eveything as you write later. Python works with numbers, but at the moment it has to display such a number it has to produce something that is printable. So it will build a string that can be used as a notation for that number, a numeral. And that is what will be displayed. -- Antoon. -- http://mail.python.org/mailman/listinfo/python-list
Re: Don't feed the troll... (was: Re: A few questiosn about encoding)
On Jun 14, 3:20 pm, Fábio Santos fabiosantos...@gmail.com wrote: Come on now, this is _so_ obviously trolling, it's not even remotely funny anymore. Why doesn't killfiling work with the mailing list version of the python list? :-( I have skimmed the archives for this month, and I estimate that a third of this month's activity on this list was helping this person. About 80% of that is wasted in explaining basic concepts he refuses to read in links given to him. A depressingly large number of replies to his posts are seemingly ignored. Since this is a lot of spam, I feel like leaving the list, but I also honestly want to help people use python and the replies to questions of others often give me much insight on several matters. Adding my +1 to this sentiment. In older saner and more politically incorrect times, when there was a student who was as idiotic as Nikos, he would be made to: -- run five rounds of the field -- stay after school -- write pages of I shall not talk in class In the age of cut-n-paste the last has lost its sting. Likewise the first two are hard to administer across the internet. Still if we are genuinely interested in solving this problem, ways may be found, for example: Any question from Nikos that has any English error, should be returned with: Correct your English before we look at your python. If he is brazen enough to correct one error and leave the other 35, then we put in a 24-hour delay for each reply. I am sure others can come up with better solutions if we wish. The alternative is that this disease has an unfavorable prognosis: [Yes Nikos is an infectious disease: I believe I can pull out mails from Steven and Grant Edwards whic hare begng tolook sspcicious ly like Nikos [Sorry Im not much good at imitation!] ] And that unfavorable prognosis is what Fabio is suggesting -- people will start leaving the list/group. Nikos: This is not against you personally. Just your current mode of conduct towards this list. And that mode quite simply is this: You have no interest in python, you are only interested in the immediate questions of your web-hosting. -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14/6/2013 1:19 μμ, Cameron Simpson wrote: On 14Jun2013 11:37, Nikos as SuperHost Support supp...@superhost.gr wrote: | On 14/6/2013 11:22 πμ, Antoon Pardon wrote: | | Python prints numbers: | No it doesn't, numbers are abstract concepts that can be represented in | various notations, these notations are strings. Those notaional strings | end up being printed. As I said before we are so used in using the | decimal notation that we often use the notation and the number interchangebly | without a problem. But when we are working with multiple notations that | can become confusing and we should be careful to seperate numbers from their | representaions/notations. | | How do we separate a number then from its represenation-natation? Shrug. When you print a number, Python transcribes a string representation of it to your terminal. 16 16 So the output 16 is in fact a string representation of the number 16 ? Then in what 16 and '16; differ to? | What is a notation anywat? is it a way of displayment? but that | would be a represeantion then Yep. Same thing. A notation is a particulart formal method of representation. Can you elaborate please? | No it doesn't, numbers are abstract concepts that can be represented in | various notations | | but when we need a decimal integer | | There are no decimal integers. There is only a decimal notation of the number. | Decimal, octal etc are not characteristics of the numbers themselves. | | So everything we see like: | | 16474 | nikos | abc123 | | everything is a string and nothing is a number? not even number 1? Everything you see like that is textual information. Internally to Python, various types are used: strings, bytes, integers etc. But when you print something, text is output. Cheers, Thanks! -- What is now proved was at first only imagined! -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14/6/2013 1:50 μμ, Antoon Pardon wrote: Python works with numbers, but at the moment it has to display such a number it has to produce something that is printable. So it will build a string that can be used as a notation for that number, a numeral. And that is what will be displayed. so a number is just a number but when this number needs to be displayed into a monitor, then the printed form of that number we choose to call it a numeral? So, a numeral = a string representation of a number. Is this correct? -- What is now proved was at first only imagined! -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
Op 14-06-13 14:59, Nick the Gr33k schreef: On 14/6/2013 1:50 μμ, Antoon Pardon wrote: Python works with numbers, but at the moment it has to display such a number it has to produce something that is printable. So it will build a string that can be used as a notation for that number, a numeral. And that is what will be displayed. so a number is just a number but when this number needs to be displayed into a monitor, then the printed form of that number we choose to call it a numeral? So, a numeral = a string representation of a number. Is this correct? Yes, when you print an integer, what actually happens is something along the following algorithm (python 2 code): def write_int(out, nr): ord0 = ord('0') lst = [] negative = False if nr 0: negative = True nr = -nr while nr: digit = nr % 10 lst.append(chr(digit + ord0)) nr /= 10 if negative: lst.append('-') lst.reverse() if not lst: lst.append('0') numeral = ''.join(lst) out.write(numeral) -- Antoon Pardon -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14/6/2013 1:14 μμ, Cameron Simpson wrote: Normally a character in a b'...' item represents the byte value matching the character's Unicode ordinal value. The only thing that i didn't understood is this line. First please tell me what is a byte value \x1b is a sequence you find inside strings (and byte strings, the b'...' format). \x1b is a character(ESC) represented in hex format b'\x1b' is a byte object that represents what? chr(27).encode('utf-8') b'\x1b' b'\x1b'.decode('utf-8') '\x1b' After decoding it gives the char ESC in hex format Shouldn't it result in value 27 which is the ordinal of ESC ? No, I mean conceptually, there is no difference between a code-point and its ordinal value. They are the same thing. Why Unicode charset doesn't just contain characters, but instead it contains a mapping of (characters -- ordinals) ? I mean what we do is to encode a character like chr(65).encode('utf-8') What's the reason of existence of its corresponding ordinal value since it doesn't get involved into the encoding process? Thank you very much for taking the time to explain. -- What is now proved was at first only imagined! -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
let's cut to the chase and start with telling us what you DO know Nick. That would take less typing On Fri, Jun 14, 2013 at 9:58 AM, Nick the Gr33k supp...@superhost.grwrote: On 14/6/2013 1:14 μμ, Cameron Simpson wrote: Normally a character in a b'...' item represents the byte value matching the character's Unicode ordinal value. The only thing that i didn't understood is this line. First please tell me what is a byte value \x1b is a sequence you find inside strings (and byte strings, the b'...' format). \x1b is a character(ESC) represented in hex format b'\x1b' is a byte object that represents what? chr(27).encode('utf-8') b'\x1b' b'\x1b'.decode('utf-8') '\x1b' After decoding it gives the char ESC in hex format Shouldn't it result in value 27 which is the ordinal of ESC ? No, I mean conceptually, there is no difference between a code-point and its ordinal value. They are the same thing. Why Unicode charset doesn't just contain characters, but instead it contains a mapping of (characters -- ordinals) ? I mean what we do is to encode a character like chr(65).encode('utf-8') What's the reason of existence of its corresponding ordinal value since it doesn't get involved into the encoding process? Thank you very much for taking the time to explain. -- What is now proved was at first only imagined! -- http://mail.python.org/**mailman/listinfo/python-listhttp://mail.python.org/mailman/listinfo/python-list -- Joel Goldstick http://joelgoldstick.com -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14/6/2013 6:21 μμ, Joel Goldstick wrote: let's cut to the chase and start with telling us what you DO know Nick. That would take less typing Well, my biggest successes up until now where to build 3 websites utilizing database saves and retrievals in PHP in Perl and later in Python with absolute ignorance of Apache Configuration: CGI: Linux: with just basic knowledge of linux. I'am very proud of it. -- What is now proved was at first only imagined! -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Sat, Jun 15, 2013 at 1:26 AM, Nick the Gr33k supp...@superhost.gr wrote: Well, my biggest successes up until now where to build 3 websites utilizing database saves and retrievals in PHP in Perl and later in Python with absolute ignorance of Apache Configuration: CGI: Linux: with just basic knowledge of linux. I'am very proud of it. Translation: I just built a car. I don't know anything about internal combustion engines or road rules or metalwork, and I'm very proud of the monstrosity that I'm now selling to my friends. Would you buy a car built by someone who proudly announces that he has no clue how to build one? Why do you sell web hosting services when you have no clue how to provide them? ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Don't feed the troll... (was: Re: A few questiosn about encoding)
On Fri, 14 Jun 2013 11:06:55 +0200 Heiko Wundram modeln...@modelnine.org wrote: Come on now, this is _so_ obviously trolling, it's not even remotely funny anymore. Why doesn't killfiling work with the mailing list version of the python list? :-( A big problem, other than Mr. Support's shenanigans with his email address, is that even those of us who seem to have successfully *plonked* him get the responses to him. The biggest issue with a troll isn't so much the annoying emails from him but the amplified slew of responses. That's the point of a troll after all. The answer is to always make sure that you include the previous poster in the reply as a Cc or To. I filter out any email that has the string supp...@superhost.gr in a header so I would also filter out the replies if people would follow that simple rule. I have suggested this before but the push back I get is that then people would get two copies of the email, one to them and one to the list. My answer is simple. Get a proper email system that filters out duplicates. Is there an email client out there that does not have this facility? -- D'Arcy J.M. Cain da...@druid.net | Democracy is three wolves http://www.druid.net/darcy/| and a sheep voting on +1 416 788 2246 (DoD#0082)(eNTP) | what's for dinner. IM: da...@vex.net, VOIP: sip:da...@vex.net -- http://mail.python.org/mailman/listinfo/python-list
Re: Don't feed the troll... (was: Re: A few questiosn about encoding)
On Sat, Jun 15, 2013 at 3:13 AM, D'Arcy J.M. Cain da...@druid.net wrote: The answer is to always make sure that you include the previous poster in the reply as a Cc or To. I filter out any email that has the string supp...@superhost.gr in a header so I would also filter out the replies if people would follow that simple rule. I have suggested this before but the push back I get is that then people would get two copies of the email, one to them and one to the list. My answer is simple. Get a proper email system that filters out duplicates. Is there an email client out there that does not have this facility? The main downside to that is not the first response, to somebody@somewhere and python-list, but the subsequent ones. Do you include everyone's addresses? And if so, how do they then get off the list? (This is a serious consideration. I had some very angry people asking me to unsubscribe them from a (private) mailman list I run, but they weren't subscribed at all - they were being cc'd.) I prefer to simply mail the list. You should be able to mute entire threads, and he doesn't start more than a couple a day usually. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Don't feed the troll... (was: Re: A few questiosn about encoding)
On 2013-06-14, Chris Angelico ros...@gmail.com wrote: On Sat, Jun 15, 2013 at 3:13 AM, D'Arcy J.M. Cain da...@druid.net wrote: The answer is to always make sure that you include the previous poster in the reply as a Cc or To. I filter out any email that has the string supp...@superhost.gr in a header so I would also filter out the replies if people would follow that simple rule. I have suggested this before but the push back I get is that then people would get two copies of the email, one to them and one to the list. My answer is simple. Get a proper email system that filters out duplicates. Is there an email client out there that does not have this facility? The main downside to that is not the first response, to somebody@somewhere and python-list, but the subsequent ones. Do you include everyone's addresses? And if so, how do they then get off the list? (This is a serious consideration. I had some very angry people asking me to unsubscribe them from a (private) mailman list I run, but they weren't subscribed at all - they were being cc'd.) I think the answer is to automatically kill all threads stared by him. Unfortunately, I don't know if that's possible in most newsreaders. -- Grant Edwards grant.b.edwardsYow! A dwarf is passing out at somewhere in Detroit! gmail.com -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Sat, 15 Jun 2013 03:03:02 +1000, Chris Angelico wrote: Why do you sell web hosting services when you have no clue how to provide them? And why do you continue responding to this timewaster? Please, please just killfile him and let's all move on. -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14Jun2013 15:59, Nikos as SuperHost Support supp...@superhost.gr wrote: | So, a numeral = a string representation of a number. Is this correct? No, a numeral is an individual digit from the string representation of a number. So: 65 requires two numerals: '6' and '5'. -- Cameron Simpson c...@zip.com.au In life, you should always try to know your strong points, but this is far less important than knowing your weak points. Martin Fitzpatrick mfitzpatr...@scot.bbc.co.uk -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14Jun2013 16:58, Nikos as SuperHost Support supp...@superhost.gr wrote: | On 14/6/2013 1:14 μμ, Cameron Simpson wrote: | Normally a character in a b'...' item represents the byte value | matching the character's Unicode ordinal value. | | The only thing that i didn't understood is this line. | First please tell me what is a byte value The numeric value stored in a byte. Bytes are just small integers in the range 0..255; the values available with 8 bits of storage. | \x1b is a sequence you find inside strings (and byte strings, the | b'...' format). | | \x1b is a character(ESC) represented in hex format Yes. | b'\x1b' is a byte object that represents what? An array of 1 byte, whose value is 0x1b or 27. | chr(27).encode('utf-8') | b'\x1b' Transcribing the ESC Unicode character to byte storage. | b'\x1b'.decode('utf-8') | '\x1b' Reading a single byte array containing a 27 and decoding it assuming 'utf-8'. This obtains a single character string containing the ESC character. | After decoding it gives the char ESC in hex format | Shouldn't it result in value 27 which is the ordinal of ESC ? When printing strings, the non-printable characters in the string are _represented_ in hex format, so \x1b was printed. | No, I mean conceptually, there is no difference between a code-point | and its ordinal value. They are the same thing. | | Why Unicode charset doesn't just contain characters, but instead it | contains a mapping of (characters -- ordinals) ? Look, as far as a computer is concerned a character and an ordinal are the same thing because you just store character ordinals in memory when you store a string. When characters are _displayed_, your Terminal (or web browser or whatever) takes character ordinals and looks them up in a _font_, which is a mapping of character ordinals to glyphs (character images), and renders the character image onto your screen. | I mean what we do is to encode a character like chr(65).encode('utf-8') | What's the reason of existence of its corresponding ordinal value | since it doesn't get involved into the encoding process? Stop thinking of Unicode code points and ordinal values as separate things. They are effectively two terms for the same thing. So there is no corresponding ordinal value. 65 _is_ the ordinal value. When you run: chr(65).encode('utf-8') you're going: chr(65) == 'A' Producing a string with just one character in it. Internally, Python stores an array of character ordinals, thus: [65] 'A'.encode('utf-8') Walk along all the ordinals in the string and transribe them as bytes. For 65, the byte encoding in 'utf-8' is a single byte of value 65. So you get an array of bytes (a bytes object in Python), thus: [65] -- Cameron Simpson c...@zip.com.au The double cam chain setup on the 1980's DOHC CB750 was another one of Honda's pointless engineering breakthroughs. You know the cycle (if you'll pardon the pun :-), Wonderful New Feature is introduced with much fanfare, WNF is fawned over by the press, WNF is copied by the other three Japanese makers (this step is sometimes optional), and finally, WNF is quietly dropped by Honda. - Blaine Gardner, blgar...@sim.es.com -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 13/6/2013 3:13 πμ, Steven D'Aprano wrote: On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote: So, how many bytes does UTF-8 stored for codepoints 127 ? Two, three or four, depending on the codepoint. The amount of bytes needed by UTF-8 to store a code-point(character), depends on the ordinal value of the code-point in the Unicode charset, correct? If this is correct then the higher the ordinal value(which is an decimal integer) in the Unicode charset the more bytes needed for storage. Its like the bigger a decimal integer is the bigger binary number it produces. Is this correct? example for codepoint 256, 1345, 16474 ? You can do this yourself. I have already given you enough information in previous emails to answer this question on your own, but here it is again: Open an interactive Python session, and run this code: c = ord(16474) len(c.encode('utf-8')) That will tell you how many bytes are used for that example. This si actually wrong. ord()'s arguments must be a character for which we expect its ordinal value. chr(16474) '䁚' Some Chinese symbol. So code-point '䁚' has a Unicode ordinal value of 16474, correct? where in after encoding this glyph's ordinal value to binary gives us the following bytes: bin(16474).encode('utf-8') b'0b10001011010' Now, we take tow symbols out: 'b' symbolism which is there to tell us that we are looking a bytes object as well as the '0b' symbolism which is there to tell us that we are looking a binary representation of a bytes object Thus, there we count 15 bits left. So it says 15 bits, which is 1-bit less that 2 bytes. Is the above statements correct please? but thinking this through more and more: chr(16474).encode('utf-8') b'\xe4\x81\x9a' len(b'\xe4\x81\x9a') 3 it seems that the bytestring the encode process produces is of length 3. So i take it is 3 bytes? but there is a mismatch of what bin(16474).encode('utf-8') and chr(16474).encode('utf-8') is telling us here. Care to explain that too please ? -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 12/6/2013 11:30 μμ, Nobody wrote: On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote: So, how many bytes does UTF-8 stored for codepoints 127 ? U+..U+007F 1 byte U+0080..U+07FF 2 bytes U+0800..U+ 3 bytes =U+1 4 bytes 'U' stands for Unicode code-point which means a character right? How can you be able to tell up to what character utf-8 needs 1 byte or 2 bytes or 3? And some of the bytes' bits are used to tell where a code-points representations stops, right? i mean if we have a code-point that needs 2 bytes to be stored that the high bit must be set to 1 to signify that this character's encoding stops at 2 bytes. I just know that 2^8 = 256, that's by first look 265 places, which mean 256 positions to hold a code-point which in turn means a character. We take the high bit out and then we have 2^7 which is enough positions for 0-127 standard ASCII. High bit is set to '0' to signify that char is encoded in 1 byte. Please tell me that i understood correct so far. But how about for 2 or 3 or 4 bytes? Am i saying ti correct ? -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
-- UTF-8, Unicode (consortium): 1 to 4 *Unicode Transformation Unit* UTF-8, ISO 10646: 1 to 6 *Unicode Transformation Unit* (still actual, unless tealy freshly modified) jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Thu, Jun 13, 2013 at 4:21 PM, Νικόλαος Κούρας supp...@superhost.gr wrote: How can you be able to tell up to what character utf-8 needs 1 byte or 2 bytes or 3? You look up Wikipedia, using the handy links that have been put to you MULTIPLE TIMES. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Thu, 13 Jun 2013 09:09:19 +0300, Νικόλαος Κούρας wrote: On 13/6/2013 3:13 πμ, Steven D'Aprano wrote: Open an interactive Python session, and run this code: c = ord(16474) len(c.encode('utf-8')) That will tell you how many bytes are used for that example. This si actually wrong. ord()'s arguments must be a character for which we expect its ordinal value. Gah! That's twice I've screwed that up. Sorry about that! chr(16474) '䁚' Some Chinese symbol. So code-point '䁚' has a Unicode ordinal value of 16474, correct? Correct. where in after encoding this glyph's ordinal value to binary gives us the following bytes: bin(16474).encode('utf-8') b'0b10001011010' No! That creates a string from 16474 in base two: '0b10001011010' The leading 0b is just syntax to tell you this is base 2, not base 8 (0o) or base 10 or base 16 (0x). Also, leading zero bits are dropped. Then you encode the string '0b10001011010' into UTF-8. There are 17 characters in this string, and they are all ASCII characters to they take up 1 byte each, giving you bytes b'0b10001011010' (in ASCII form). In hex form, they are: b'\x30\x62\x31\x30\x30\x30\x30\x30\x30\x30\x31\x30\x31\x31\x30\x31\x30' which takes up a lot more room, which is why Python prefers to show ASCII characters as characters rather than as hex. What you want is: chr(16474).encode('utf-8') [...] Thus, there we count 15 bits left. So it says 15 bits, which is 1-bit less that 2 bytes. Is the above statements correct please? No. There are 17 BYTES there. The string 0 doesn't get turned into a single bit. It still takes up a full byte, 0x30, which is 8 bits. but thinking this through more and more: chr(16474).encode('utf-8') b'\xe4\x81\x9a' len(b'\xe4\x81\x9a') 3 it seems that the bytestring the encode process produces is of length 3. Correct! Now you have got the right idea. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 13/6/2013 10:11 πμ, Steven D'Aprano wrote: chr(16474) '䁚' Some Chinese symbol. So code-point '䁚' has a Unicode ordinal value of 16474, correct? Correct. where in after encoding this glyph's ordinal value to binary gives us the following bytes: bin(16474).encode('utf-8') b'0b10001011010' An observations here that you please confirm as valid. 1. A code-point and the code-point's ordinal value are associated into a Unicode charset. They have the so called 1:1 mapping. So, i was under the impression that by encoding the code-point into utf-8 was the same as encoding the code-point's ordinal value into utf-8. That is why i tried to: bin(16474).encode('utf-8') instead of chr(16474).encode('utf-8') So, now i believe they are two different things. The code-point *is what actually* needs to be encoded and *not* its ordinal value. The leading 0b is just syntax to tell you this is base 2, not base 8 (0o) or base 10 or base 16 (0x). Also, leading zero bits are dropped. But byte objects are represented as '\x' instead of the aforementioned '0x'. Why is that? No! That creates a string from 16474 in base two: '0b10001011010' I disagree here. 16474 is a number in base 10. Doing bin(16474) we get the binary representation of number 16474 and not a string. Why you say we receive a string while python presents a binary number? Then you encode the string '0b10001011010' into UTF-8. There are 17 characters in this string, and they are all ASCII characters to they take up 1 byte each, giving you bytes b'0b10001011010' (in ASCII form). 0b10001011010 stands for a number in base 2 for me not as a string. Have i understood something wrong? -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Thu, Jun 13, 2013 at 5:42 PM, Νικόλαος Κούρας supp...@superhost.gr wrote: On 13/6/2013 10:11 πμ, Steven D'Aprano wrote: No! That creates a string from 16474 in base two: '0b10001011010' I disagree here. 16474 is a number in base 10. Doing bin(16474) we get the binary representation of number 16474 and not a string. Why you say we receive a string while python presents a binary number? You can disagree all you like. Steven cited a simple point of fact, one which can be verified in any Python interpreter. Nikos, you are flat wrong here; bin(16474) creates a string. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 13/6/2013 10:58 πμ, Chris Angelico wrote: On Thu, Jun 13, 2013 at 5:42 PM, �� supp...@superhost.gr wrote: On 13/6/2013 10:11 ��, Steven D'Aprano wrote: No! That creates a string from 16474 in base two: '0b10001011010' I disagree here. 16474 is a number in base 10. Doing bin(16474) we get the binary representation of number 16474 and not a string. Why you say we receive a string while python presents a binary number? You can disagree all you like. Steven cited a simple point of fact, one which can be verified in any Python interpreter. Nikos, you are flat wrong here; bin(16474) creates a string. Indeed python embraced it in single quoting '0b10001011010' and not as 0b10001011010 which in fact makes it a string. But since bin(16474) seems to create a string rather than an expected number(at leat into my mind) then how do we get the binary representation of the number 16474 as a number? -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Thu, Jun 13, 2013 at 6:08 PM, Νικόλαος Κούρας supp...@superhost.gr wrote: On 13/6/2013 10:58 πμ, Chris Angelico wrote: On Thu, Jun 13, 2013 at 5:42 PM, �� supp...@superhost.gr wrote: On 13/6/2013 10:11 ��, Steven D'Aprano wrote: No! That creates a string from 16474 in base two: '0b10001011010' I disagree here. 16474 is a number in base 10. Doing bin(16474) we get the binary representation of number 16474 and not a string. Why you say we receive a string while python presents a binary number? You can disagree all you like. Steven cited a simple point of fact, one which can be verified in any Python interpreter. Nikos, you are flat wrong here; bin(16474) creates a string. Indeed python embraced it in single quoting '0b10001011010' and not as 0b10001011010 which in fact makes it a string. But since bin(16474) seems to create a string rather than an expected number(at leat into my mind) then how do we get the binary representation of the number 16474 as a number? In Python 2: 16474 In Python 3, you have to fiddle around with ctypes, but broadly speaking, the same thing. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 13/6/2013 11:20 πμ, Chris Angelico wrote: On Thu, Jun 13, 2013 at 6:08 PM, Νικόλαος Κούρας supp...@superhost.gr wrote: On 13/6/2013 10:58 πμ, Chris Angelico wrote: On Thu, Jun 13, 2013 at 5:42 PM, �� supp...@superhost.gr wrote: On 13/6/2013 10:11 ��, Steven D'Aprano wrote: No! That creates a string from 16474 in base two: '0b10001011010' I disagree here. 16474 is a number in base 10. Doing bin(16474) we get the binary representation of number 16474 and not a string. Why you say we receive a string while python presents a binary number? You can disagree all you like. Steven cited a simple point of fact, one which can be verified in any Python interpreter. Nikos, you are flat wrong here; bin(16474) creates a string. Indeed python embraced it in single quoting '0b10001011010' and not as 0b10001011010 which in fact makes it a string. But since bin(16474) seems to create a string rather than an expected number(at leat into my mind) then how do we get the binary representation of the number 16474 as a number? In Python 2: 16474 typing 16474 in interactive session both in python 2 and 3 gives back the number 16474 while we want the the binary representation of the number 16474 -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Thu, 13 Jun 2013 12:01:55 +1000, Chris Angelico wrote: On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but that's not UTF-8, that's UTF-8-plus-extra-codepoints. And a proper UTF-8 decoder will reject \xC0\x80 and \xed\xa0\x80, even though mathematically they would translate into U+ and U+D800 respectively. The UTF-16 *mechanism* is limited to no more than Unicode has currently used, but I'm left wondering if that's actually the other way around - that Unicode planes were deemed to stop at the point where UTF-16 can't encode any more. Indeed. 5-byte and 6-byte sequences were originally part of the UTF-8 specification, allowing for 31 bits. Later revisions of the standard imposed the UTF-16 limit on Unicode as a whole. -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Thu, 13 Jun 2013 12:41:41 +0300, Νικόλαος Κούρας wrote: In Python 2: 16474 typing 16474 in interactive session both in python 2 and 3 gives back the number 16474 while we want the the binary representation of the number 16474 Python does not work that way. Ints *always* display in decimal. Regardless of whether you enter the decimal in binary: py 0b10001011010 16474 octal: py 0o40132 16474 or hexadecimal: py 0x405A 16474 ints always display in decimal. The only way to display in another base is to build a string showing what the int would look like in a different base: py hex(16474) '0x405a' Notice that the return value of bin, oct and hex are all strings. If they were ints, then they would display in decimal, defeating the purpose! -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 13/6/2013 2:49 μμ, Steven D'Aprano wrote: Please confirm these are true statement: A code-point and the code-point's ordinal value are associated into a Unicode charset. They have the so called 1:1 mapping. So, i was under the impression that by encoding the code-point into utf-8 was the same as encoding the code-point's ordinal value into utf-8. So, now i believe they are two different things. The code-point *is what actually* needs to be encoded and *not* its ordinal value. The leading 0b is just syntax to tell you this is base 2, not base 8 (0o) or base 10 or base 16 (0x). Also, leading zero bits are dropped. But byte objects are represented as '\x' instead of the aforementioned '0x'. Why is that? ints always display in decimal. The only way to display in another base is to build a string showing what the int would look like in a different base: py hex(16474) '0x405a' Notice that the return value of bin, oct and hex are all strings. If they were ints, then they would display in decimal, defeating the purpose! Thank you didn't knew that! indeed it working like this. To encode a number we have to turn it into a string first. 16474.encode('utf-8') b'16474' That 'b' stand for bytes. How can i view this byte's object representation as hex() or as bin()? Also: len('0b10001011010') 17 You said this string consists of 17 chars. Why the leading syntax of '0b' counts as bits as well? Shouldn't be 15 bits instead of 17? -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 13Jun2013 17:19, Nikos as SuperHost Support supp...@superhost.gr wrote: | A code-point and the code-point's ordinal value are associated into | a Unicode charset. They have the so called 1:1 mapping. | | So, i was under the impression that by encoding the code-point into | utf-8 was the same as encoding the code-point's ordinal value into | utf-8. | | So, now i believe they are two different things. | The code-point *is what actually* needs to be encoded and *not* its | ordinal value. Because there is a 1:1 mapping, these are the same thing: a code point is directly _represented_ by the ordinal value, and the ordinal value is encoded for storage as bytes. | The leading 0b is just syntax to tell you this is base 2, not base 8 | (0o) or base 10 or base 16 (0x). Also, leading zero bits are dropped. | | But byte objects are represented as '\x' instead of the | aforementioned '0x'. Why is that? You're confusing a string representation of a single number in some base (eg 2 or 16) with the string-ish representation of a bytes object. The former is just notation for writing a number in different bases, eg: 27base 10 1bbase 16 33base 8 11011 base 2 A common convention, and the one used by hex(), oct() and bin() in Python, is to prefix the non-base-10 representations with 0x for base 16, 0o for base 8 (octal) and 0b for base 2 (binary): 27 0x1b 0o33 0b11011 This allows the human reader or a machine lexer to decide what base the number is written in, and therefore to figure out what the underlying numeric value is. Conversely, consider the bytes object consisting of the values [97, 98, 99, 27, 10]. In ASCII (and UTF-8 and the iso-8859-x encodings) these may all represent the characters ['a', 'b', 'c', ESC, NL]. So when printing a bytes object, which is a sequence of small integers representing values stored in bytes, it is compact to print: b'abc\x1b\n' which is ['a', 'b', 'c', chr(27), newline]. The slosh (\) is the common convention in C-like languages and many others for representing special characters not directly represents by themselves. So \\ for a slosh, \n for a newline and \x1b for character 27 (ESC). The bytes object is still just a sequence on integers, but because it is very common to have those integers represent text, and very common to have some text one want represented as bytes in a direct 1:1 mapping, this compact text form is useful and readable. It is also legal Python syntax for making a small bytes object. To demonstrate that this is just a _representation_, run this: [ i for i in b'abc\x1b\n' ] [97, 98, 99, 27, 10] at an interactive Python 3 prompt. See? Just numbers. | To encode a number we have to turn it into a string first. | | 16474.encode('utf-8') | b'16474' | | That 'b' stand for bytes. Syntactic details. Read this: http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals | How can i view this byte's object representation as hex() or as bin()? See above. A bytes is a _sequence_ of values. hex() and bin() print individual values in hexadecimal or binary respectively. You could do this: for value in b'16474': print(value, hex(value), bin(value)) Cheers, -- Cameron Simpson c...@zip.com.au Uhlmann's Razor: When stupidity is a sufficient explanation, there is no need to have recourse to any other. - Michael M. Uhlmann, assistant attorney general for legislation in the Ford Administration -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 14/6/2013 1:46 πμ, Dennis Lee Bieber wrote: On Wed, 12 Jun 2013 09:09:05 + (UTC), ?? supp...@superhost.gr declaimed the following: (*) infact UTF8 also indicates the end of each character Up to a point. The initial byte encodes the length and the top few bits, but the subsequent octets aren’t distinguishable as final in isolation. 0x80-0xBF can all be either medial or final. So, the first high-bits are a directive that UTF-8 uses to know how many bytes each character is being represented as. 0-127 codepoints(characters) use 1 bit to signify they need 1 bit for storage and the rest 7 bits to actually store the character ? Not quite... The leading bit is a 0 - which means 0..127 are sent as-is, no manipulation. So, in utf-8, the leading bit which is a zero 0, its actually a flag to tell that the code-point needs 1 byte to be stored and the rest 7 bits is for the actual value of 0-127 code-points ? 128-256 codepoints(characters) use 2 bit to signify they need 2 bits for storage and the rest 14 bits to actually store the character ? 128..255 -- in what encoding? These all have the leading bit with a value of 1. In 8-bit encodings (ISO-Latin-1) the meaning of those values is inherent in the specified encoding and they are sent as-is. So, latin-iso or greek-iso, the leading 0 is not a flag like it is in utf-8 encoding because latin-iso and greek-iso and all *-iso use all 8 bits for storage? But, in utf-8, the leading bit, which is 1, is to tell that the code-point needs 2 byte to be stored and the rest 7 bits is for the actual value of 128-255 code-points ? But why 2 bytes? leading 1 is a flag and the rest 7 bits can hold the encoded value. Bu that is not the case since we know that utf-8 needs 2 bytes to store code-points 127-255 1110 starts a three byte sequence, 0 starts a four byte sequence... Basically, count the number of leading 1-bits before a 0 bit, and that tells you how many bytes are in the multi-byte sequence -- and all bytes that start with 10 are supposed to be the continuations of a multibyte set (and not a signal that this is a 1-byte entry -- those only have a leading 0) Why doesn't it work like this? leading 0 = 1 byte flag leading 1 = 2 bytes flag leading 00 = 3 bytes flag leading 01 = 4 bytes flag leading 10 = 5 bytes flag leading 11 = 6 bytes flag Wouldn't it be more logical? Original UTF-8 allowed for 31-bits to specify a character in the Unicode set. It used 6 bytes -- 48 bits total, but 7 bits of the first byte were the flag (6 leading 1 bits and a 0 bit), and two bits (leading 10) of each continuation. utf8 6 byted = 48 bits - 7 bits(from first bytes) - 2 bits(for each continuation) * 5 = 48 - 7 - 10 = 31 bits indeed to store the actual code-point. But 2^31 is still a huge number to store any kind of character isnt it? -- What is now proved was at first only imagined! -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
(*) infact UTF8 also indicates the end of each character Up to a point. The initial byte encodes the length and the top few bits, but the subsequent octets aren’t distinguishable as final in isolation. 0x80-0xBF can all be either medial or final. So, the first high-bits are a directive that UTF-8 uses to know how many bytes each character is being represented as. 0-127 codepoints(characters) use 1 bit to signify they need 1 bit for storage and the rest 7 bits to actually store the character ? while 128-256 codepoints(characters) use 2 bit to signify they need 2 bits for storage and the rest 14 bits to actually store the character ? Isn't 14 bits way to many to store a character ? -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Wed, 12 Jun 2013 09:09:05 +, Νικόλαος Κούρας wrote: Isn't 14 bits way to many to store a character ? No. There are 1114111 possible characters in Unicode. (And in Japan, they sometimes use TRON instead of Unicode, which has even more.) If you list out all the combinations of 14 bits: 00 01 10 11 [...] 10 11 you will see that there are only 32767 (2**15-1) such values. You can't fit 1114111 characters with just 32767 values. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 12/6/2013 12:24 μμ, Steven D'Aprano wrote: On Wed, 12 Jun 2013 09:09:05 +, Νικόλαος Κούρας wrote: Isn't 14 bits way to many to store a character ? No. There are 1114111 possible characters in Unicode. (And in Japan, they sometimes use TRON instead of Unicode, which has even more.) If you list out all the combinations of 14 bits: 00 01 10 11 [...] 10 11 you will see that there are only 32767 (2**15-1) such values. You can't fit 1114111 characters with just 32767 values. Thanks Steven, So, how many bytes does UTF-8 stored for codepoints 127 ? example for codepoint 256, 1345, 16474 ? -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 06/12/2013 05:24 AM, Steven D'Aprano wrote: On Wed, 12 Jun 2013 09:09:05 +, Νικόλαος Κούρας wrote: Isn't 14 bits way to many to store a character ? No. There are 1114111 possible characters in Unicode. (And in Japan, they sometimes use TRON instead of Unicode, which has even more.) If you list out all the combinations of 14 bits: 00 01 10 11 [...] 10 11 you will see that there are only 32767 (2**15-1) such values. You can't fit 1114111 characters with just 32767 values. Actually, it's worse. There are 16536 such values (2**14), assuming you include null, which you did in your list. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
Am 12.06.2013 13:23, schrieb Νικόλαος Κούρας: So, how many bytes does UTF-8 stored for codepoints 127 ? What has your research turned up? I personally consider it lazy and respectless to get lots of pointers that you could use for further research and ask for more info before you even followed these links. example for codepoint 256, 1345, 16474 ? Yes, examples exist. Gee, if there only was an information network that you could access and where you could locate information on various programming-related topics somehow. Seriously, someone should invent this thing! But still, even without it, you have all the tools (i.e. Python) in your hand to generate these examples yourself! Check out ord, bin, encode, decode for a start. Uli -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote: So, how many bytes does UTF-8 stored for codepoints 127 ? U+..U+007F 1 byte U+0080..U+07FF 2 bytes U+0800..U+ 3 bytes =U+1 4 bytes So, 1 byte for ASCII, 2 bytes for other Latin characters, Greek, Cyrillic, Arabic, and Hebrew, 3 bytes for Chinese/Japanese/Korean, 4 bytes for dead languages and mathematical symbols. The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a total of 31 bits, but UTF-16 is limited to U+10 (slightly more than 20 bits). -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote: So, how many bytes does UTF-8 stored for codepoints 127 ? Two, three or four, depending on the codepoint. example for codepoint 256, 1345, 16474 ? You can do this yourself. I have already given you enough information in previous emails to answer this question on your own, but here it is again: Open an interactive Python session, and run this code: c = ord(16474) len(c.encode('utf-8')) That will tell you how many bytes are used for that example. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Wed, 12 Jun 2013 21:30:23 +0100, Nobody wrote: The mechanism used by UTF-8 allows sequences of up to 6 bytes, for a total of 31 bits, but UTF-16 is limited to U+10 (slightly more than 20 bits). Same with UTF-8 and UTF-32, both of which are limited to U+10 because that is what Unicode is limited to. The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but that's not UTF-8, that's UTF-8-plus-extra-codepoints. Likewise the mechanism of UTF-32 could go up to 0x, but doing so means you don't have Unicode chars any more, and hence your byte-string is not valid UTF-32: py b = b'\xFF'*8 py b.decode('UTF-32') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3: codepoint not in range(0x11) -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but that's not UTF-8, that's UTF-8-plus-extra-codepoints. And a proper UTF-8 decoder will reject \xC0\x80 and \xed\xa0\x80, even though mathematically they would translate into U+ and U+D800 respectively. The UTF-16 *mechanism* is limited to no more than Unicode has currently used, but I'm left wondering if that's actually the other way around - that Unicode planes were deemed to stop at the point where UTF-16 can't encode any more. Not that it matters; with most of the current planes completely unallocated, it seems unlikely we'll be needing more. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On 9 Jun 2013 11:49, Νικόλαος Κούρας nikos.gr...@gmail.com wrote: A few questiosn about encoding please: Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for values up to 256? Because then how do you tell when you need one byte, and when you need two? If you read two bytes, and see 0x4C 0xFA, does that mean two characters, with ordinal values 0x4C and 0xFA, or one character with ordinal value 0x4CFA? I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. UTF-8 and UTF-16 and UTF-32 I though the number beside of UTF- was to declare how many bits the character set was using to store a character into the hdd, no? Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit values to make a surrogate pair. A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? Is this what a surrogate is? a pari of 2 chars? UTF-8 uses 8-bit values, but sometimes it combines two, three or four of them to represent a single code-point. 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is 127 ) 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal 65000 ) The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? -- http://mail.python.org/mailman/listinfo/python-list In short, a utf-8 character takes 1 to 4 bytes. A utf-16 character takes 2 to 4 bytes. A utf-32 always takes 4 bytes. The process of encoding bytes to characters is called encoding. The opposite is decoding. This is all made transparent in python with the encode() and decode() methods. You normally don't care about this kind of things. -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Sun, 09 Jun 2013 03:44:57 -0700, Νικόλαος Κούρας wrote: Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for values up to 256? Because then how do you tell when you need one byte, and when you need two? If you read two bytes, and see 0x4C 0xFA, does that mean two characters, with ordinal values 0x4C and 0xFA, or one character with ordinal value 0x4CFA? I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. But then you've used up all 256 possible bytes for storing the first 256 characters, and there aren't any left for use in multi-byte sequences. You need some means to distinguish between a single-byte character and an individual byte within a multi-byte sequence. UTF-8 does that by allocating specific ranges to specific purposes. 0x00-0x7F are single-byte characters, 0x80-0xBF are continuation bytes of multi-byte sequences, 0xC0-0xFF are leading bytes of multi-byte sequences. This scheme has the advantage of making UTF-8 non-modal, i.e. if a byte is corrupted, added or removed, it will only affect the character containing that particular byte; the encoder can re-synchronise at the beginning of the following character. OTOH, with encodings such as UTF-16, UTF-32 or ISO-2022, adding or removing a byte will result in desyncronisation, with all subsequent characters being corrupted. A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? Is this what a surrogate is? a pari of 2 chars? A surrogate pair is a pair of 16-bit codes used to represent a single Unicode character whose code is greater than 0x. The 2048 codepoints from 0xD800 to 0xDFFF inclusive aren't used to represent characters, but surrogates. Unicode characters with codes in the range 0x1-0x10 are represented in UTF-16 as a pair of surrogates. First, 0x1 is subtracted from the code, giving a value in the range 0-0xF (20 bits). The top ten bits are added to 0xD800 to give a value in the range 0xD800-0xDBFF, while the bottom ten bits are added to 0xDC00 to give a value in the range 0xDC00-0xDFFF. Because the codes used for surrogates aren't valid as individual characters, scanning a string for a particular character won't accidentally match part of a multi-word character. 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is 127 ) 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal 65000 ) Most Chinese, Japanese and Korean (CJK) characters have codepoints within the BMP (i.e. = 0x), so they only require 3 bytes in UTF-8. The codepoints above the BMP are mostly for archaic ideographs (those no longer in normal use), mathematical symbols, dead languages, etc. The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? Yes. UTF-8 is essentially a mechanism for representing 31-bit unsigned integers such that smaller integers require fewer bytes than larger integers (subsequent revisions of Unicode cap the range of possible codepoints to 0x10, as that's all that UTF-16 can handle). -- http://mail.python.org/mailman/listinfo/python-list
Re: A few questiosn about encoding
On Sun, Jun 9, 2013 at 12:44 PM, Νικόλαος Κούρας nikos.gr...@gmail.com wrote: A few questiosn about encoding please: Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for values up to 256? Because then how do you tell when you need one byte, and when you need two? If you read two bytes, and see 0x4C 0xFA, does that mean two characters, with ordinal values 0x4C and 0xFA, or one character with ordinal value 0x4CFA? I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. It is required so the computer can know where characters begin. 0x0080 (first non-ASCII character) becomes 0xC280 in UTF-8. Further details here: http://en.wikipedia.org/wiki/UTF-8#Description UTF-8 and UTF-16 and UTF-32 I though the number beside of UTF- was to declare how many bits the character set was using to store a character into the hdd, no? Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit values to make a surrogate pair. A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? Is this what a surrogate is? a pari of 2 chars? http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B1_to_U.2B10 Long story short: codepoint - 0x1 (up to 20 bits) → two 10-bit numbers → 0xD800 + first_half 0xDC00 + second_half. Rephrasing: We take MATHEMATICAL BOLD CAPITAL B (U+1D401). If you have UTF-8: 퐁 It is over 0x, and we need to use surrogate pairs. We end up with 0xD401, or 0b11010101. Both representations are worthless, as we have a 16-bit number, not a 20-bit one. We throw in some leading zeroes and end up with 0b11010101. Split it in half and we get 0b110101 and 0b01, which we can now shorten to 0b110101 and 0b1, or translate to hex as 0x0035 and 0x0001. 0xD800 + 0x0035 and 0xDC00 + 0x0035 → 0xD835 0xDC00. Type it into python and: b'\xD8\x35\xDC\x01'.decode('utf-16be') '퐁' And before you ask: that “BE” stands for Big-Endian. Little-Endian would mean reversing the bytes in a codepoint, which would make it '\x35\xD8\x01\xDC' (the name is based on the first 256 characters, which are 0x6500 for 'a' in a little-endian encoding. Another question you may ask: 0xD800…0xDFFF are reserved in Unicode for the purposes of UTF-16, so there is no conflicts. UTF-8 uses 8-bit values, but sometimes it combines two, three or four of them to represent a single code-point. 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is 127 ) yup. α is at 0x03B1, or 945 decimal. 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal 65000 ) Not necessarily, as CJK characters start at U+2E80, which is in the 3-byte range (0x0800 through 0x) — the table is here: http://en.wikipedia.org/wiki/UTF-8#Description -- Kwpolska http://kwpolska.tk | GPG KEY: 5EAAEA16 stop html mail| always bottom-post http://asciiribbon.org| http://caliburn.nl/topposting.html -- http://mail.python.org/mailman/listinfo/python-list