Re: Newbie question about text encoding

2015-03-09 Thread Rustom Mody
On Monday, March 9, 2015 at 12:05:05 PM UTC+5:30, Steven D'Aprano wrote: Chris Angelico wrote: As to the notion of rejecting the construction of strings containing these invalid codepoints, I'm not sure. Are there any languages out there that have a Unicode string type that requires that

Re: Newbie question about text encoding

2015-03-09 Thread Marko Rauhamaa
Ben Finney ben+pyt...@benfinney.id.au: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: '\udd00' should be a SyntaxError. I find your argument convincing, that attempting to construct a Unicode string of a lone surrogate should be an error. Then we're back to square one:

Re: Newbie question about text encoding

2015-03-09 Thread Chris Angelico
On Mon, Mar 9, 2015 at 5:34 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Chris Angelico wrote: As to the notion of rejecting the construction of strings containing these invalid codepoints, I'm not sure. Are there any languages out there that have a Unicode string type that

Re: Newbie question about text encoding

2015-03-09 Thread Steven D'Aprano
Chris Angelico wrote: As to the notion of rejecting the construction of strings containing these invalid codepoints, I'm not sure. Are there any languages out there that have a Unicode string type that requires that all codepoints be valid (no surrogates, no U+FFFE, etc)? U+FFFE and U+

Re: Newbie question about text encoding

2015-03-08 Thread Chris Angelico
On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Marko Rauhamaa wrote: Chris Angelico ros...@gmail.com: Once again, you appear to be surprised that invalid data is failing. Why is this so strange? U+DD00 is not a valid character. But it is a valid

Re: Newbie question about text encoding

2015-03-08 Thread Chris Angelico
On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Perhaps the bug is not UTF-8's inability to encode lone surrogates, but that Python allows you to create lone surrogates in the first place. That's not a rhetorical question. It's a genuine question. As

Re: Newbie question about text encoding

2015-03-08 Thread Steven D'Aprano
Rustom Mody wrote: On Saturday, March 7, 2015 at 4:39:48 PM UTC+5:30, Steven D'Aprano wrote: Rustom Mody wrote: This includes not just bug-prone-system code such as Java and Windows but seemingly working code such as python 3. What Unicode bugs do you think Python 3.3 and above have?

Re: Newbie question about text encoding

2015-03-08 Thread Steven D'Aprano
Marko Rauhamaa wrote: Chris Angelico ros...@gmail.com: Once again, you appear to be surprised that invalid data is failing. Why is this so strange? U+DD00 is not a valid character. But it is a valid non-character code point. It is quite correct to throw this error. '\udd00' is a

Re: Newbie question about text encoding

2015-03-08 Thread Marko Rauhamaa
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: '\udd00' is a valid str object: Is it though? Perhaps the bug is not UTF-8's inability to encode lone surrogates, but that Python allows you to create lone surrogates in the first place. That's not a rhetorical

Re: Newbie question about text encoding

2015-03-08 Thread Steven D'Aprano
Steven D'Aprano wrote: Marko Rauhamaa wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? In Python terms, there are bytes objects b that don't satisfy:

Re: Newbie question about text encoding

2015-03-08 Thread Marko Rauhamaa
Chris Angelico ros...@gmail.com: Once again, you appear to be surprised that invalid data is failing. Why is this so strange? U+DD00 is not a valid character. It is quite correct to throw this error. '\udd00' is a valid str object: '\udd00' '\udd00' '\udd00'.encode('utf-32')

Re: Newbie question about text encoding

2015-03-08 Thread Chris Angelico
On Sun, Mar 8, 2015 at 7:09 PM, Marko Rauhamaa ma...@pacujo.net wrote: Chris Angelico ros...@gmail.com: Once again, you appear to be surprised that invalid data is failing. Why is this so strange? U+DD00 is not a valid character. It is quite correct to throw this error. '\udd00' is a valid

Re: Newbie question about text encoding

2015-03-08 Thread Rustom Mody
On Monday, March 9, 2015 at 7:39:42 AM UTC+5:30, Cameron Simpson wrote: On 07Mar2015 22:09, Steven D'Aprano wrote: Rustom Mody wrote: [...big snip...] Some parts are here some earlier and from my memory. If details wrong please correct: - 200 million records - Containing 4 strings with

Re: Newbie question about text encoding

2015-03-08 Thread Ben Finney
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: '\udd00' should be a SyntaxError. I find your argument convincing, that attempting to construct a Unicode string of a lone surrogate should be an error. Shouldn't the error type be a ValueError, though? The statement is not, to my

Re: Newbie question about text encoding

2015-03-08 Thread Cameron Simpson
On 07Mar2015 22:09, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Rustom Mody wrote: [...big snip...] Some parts are here some earlier and from my memory. If details wrong please correct: - 200 million records - Containing 4 strings with SMP characters - System made with python

Re: Newbie question about text encoding

2015-03-08 Thread Chris Angelico
On Mon, Mar 9, 2015 at 1:09 PM, Ben Finney ben+pyt...@benfinney.id.au wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: '\udd00' should be a SyntaxError. I find your argument convincing, that attempting to construct a Unicode string of a lone surrogate should be an error.

Re: Newbie question about text encoding

2015-03-08 Thread random832
On Sun, Mar 8, 2015, at 22:09, Ben Finney wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: '\udd00' should be a SyntaxError. I find your argument convincing, that attempting to construct a Unicode string of a lone surrogate should be an error. Shouldn't the error

Re: Newbie question about text encoding

2015-03-08 Thread Steven D'Aprano
Marko Rauhamaa wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: '\udd00' is a valid str object: Is it though? Perhaps the bug is not UTF-8's inability to encode lone surrogates, but that Python allows you to create lone surrogates in the first place.

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: For those cases where you do wish to take an arbitrary byte stream and round-trip it, Python now provides an error handler for that. py import random py b = bytes([random.randint(0, 255) for _ in range(1)]) py s = b.decode('utf-8')

Re: Newbie question about text encoding

2015-03-07 Thread Rustom Mody
On Saturday, March 7, 2015 at 4:39:48 PM UTC+5:30, Steven D'Aprano wrote: Rustom Mody wrote: This includes not just bug-prone-system code such as Java and Windows but seemingly working code such as python 3. What Unicode bugs do you think Python 3.3 and above have? Literal/Legalistic

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 6:20 PM, Marko Rauhamaa ma...@pacujo.net wrote: * it still isn't bijective between str and bytes: '\udd00'.encode('utf-8', errors='surrogateescape') Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'utf-8' codec can't

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? In Python terms, there are bytes objects b that don't satisfy: b.decode('utf-8').encode('utf-8') == b Marko --

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa ma...@pacujo.net wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? In Python terms, there are bytes objects b that

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa ma...@pacujo.net wrote: Chris Angelico ros...@gmail.com: On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa ma...@pacujo.net wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from

Re: Newbie question about text encoding

2015-03-07 Thread Mark Lawrence
On 07/03/2015 16:25, Marko Rauhamaa wrote: Chris Angelico ros...@gmail.com: On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa ma...@pacujo.net wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Chris Angelico ros...@gmail.com: On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa ma...@pacujo.net wrote: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Here's an example: b = b'\x80' Yes, it generates an exception. IOW, UTF-8 is not a

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote: There are two things happening here: 1) The underlying file system is not UTF-8, and you can't depend on that, Correct. Linux pathnames are octet strings regardless of the locale. That's why Linux developers should

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 5:34 AM, Dan Sommers d...@tombstonezero.net wrote: I think we're all agreeing: not all file systems are the same, and Python doesn't smooth out all of the bumps, even for something that seems as simple as displaying the names of files in a directory. And that's *after*

Re: Newbie question about text encoding

2015-03-07 Thread Mark Lawrence
On 07/03/2015 16:48, Marko Rauhamaa wrote: Mark Lawrence breamore...@yahoo.co.uk: On 07/03/2015 16:25, Marko Rauhamaa wrote: Here's an example: b = b'\x80' Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping from str objects to bytes objects. Python 2 might, Python

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Dan Sommers d...@tombstonezero.net: I think we're all agreeing: not all file systems are the same, and Python doesn't smooth out all of the bumps, even for something that seems as simple as displaying the names of files in a directory. And that's *after* we've agreed that filesystems contain

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 3:40 AM, Mark Lawrence breamore...@yahoo.co.uk wrote: Here's an example: b = b'\x80' Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping from str objects to bytes objects. Python 2 might, Python 3 doesn't. He was talking about this line of

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Mark Lawrence breamore...@yahoo.co.uk: It would clearly help if you were to type in the correct UK English accent. Your ad-hominem-to-contribution ratio is alarmingly high. Marko -- https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa ma...@pacujo.net wrote: See: $ mkdir /tmp/xyz $ touch /tmp/xyz/ \x80' $ python3 Python 3.3.2 (default, Dec 4 2014, 12:49:00) [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux Type help, copyright, credits or license for more

Re: Newbie question about text encoding

2015-03-07 Thread Mark Lawrence
On 07/03/2015 17:16, Marko Rauhamaa wrote: Mark Lawrence breamore...@yahoo.co.uk: It would clearly help if you were to type in the correct UK English accent. Your ad-hominem-to-contribution ratio is alarmingly high. Marko You've been a PITA ever since you first joined this list, what

Re: Newbie question about text encoding

2015-03-07 Thread Dan Sommers
On Sun, 08 Mar 2015 05:13:09 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers d...@tombstonezero.net wrote: On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote: Correct. Linux pathnames are

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 3:54 AM, Marko Rauhamaa ma...@pacujo.net wrote: You can't operate on file names and text files using Python strings. Or at least, you will need to add (nontrivial) exception catching logic. You can't operate on a JPG file using a Unicode string, nor an array of integers.

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Chris Angelico ros...@gmail.com: If you really REALLY can't use the bytes() type to work with something that is, yaknow, bytes, then you could use an alternative encoding that has a value for every byte. It's still not Unicode text, so it doesn't much matter which encoding you use. But it's

Re: Newbie question about text encoding

2015-03-07 Thread Dan Sommers
On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote: Correct. Linux pathnames are octet strings regardless of the locale. That's why Linux developers should refer to filenames using bytes. Unfortunately, Python

Re: Newbie question about text encoding

2015-03-07 Thread Mark Lawrence
On 07/03/2015 18:34, Dan Sommers wrote: On Sun, 08 Mar 2015 05:13:09 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers d...@tombstonezero.net wrote: On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Mark Lawrence breamore...@yahoo.co.uk: On 07/03/2015 16:25, Marko Rauhamaa wrote: Here's an example: b = b'\x80' Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping from str objects to bytes objects. Python 2 might, Python 3 doesn't. Python 3.3.2 (default, Dec

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 3:54 AM, Marko Rauhamaa ma...@pacujo.net wrote: All you've proven is that there are bit patterns which are not UTF-8 streams... And that causes problems. Demonstrate. ChrisA -- https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Chris Angelico ros...@gmail.com: On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa ma...@pacujo.net wrote: File names encoded with Latin-X are quite commonplace even in UTF-8 locales. That is not a problem with UTF-8, though. I don't understand how you're blaming UTF-8 for that. I'm saying it

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers d...@tombstonezero.net wrote: On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote: Correct. Linux pathnames are octet strings regardless of the locale. That's why Linux

Re: Newbie question about text encoding

2015-03-07 Thread Albert-Jan Roskam
--- Original Message - From: Chris Angelico ros...@gmail.com To: Cc: python-list@python.org python-list@python.org Sent: Saturday, March 7, 2015 6:26 PM Subject: Re: Newbie question about text encoding On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa ma...@pacujo.net wrote: See

Re: Newbie question about text encoding

2015-03-07 Thread Dan Sommers
On Sat, 07 Mar 2015 19:00:47 +, Mark Lawrence wrote: Isn't pathlib https://docs.python.org/3/library/pathlib.html#module-pathlib effectively a more recent attempt at smoothing or even removing (some of) the bumps? Has anybody here got experience of it as I've never used it? I almost

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Chris Angelico ros...@gmail.com: On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa ma...@pacujo.net wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? In Python

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sat, Mar 7, 2015 at 10:09 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Stop using MySQL, which is a joke of a database[1], and use Postgres which does not have this problem. I agree with the recommendation, though to be fair to MySQL, it is now possible to store full

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Rustom Mody wrote: My conclusion: Early adopters of unicode -- Windows and Java -- were punished for their early adoption. You can blame the unicode consortium, you can blame the babel of human languages, particularly that some use

Re: Newbie question about text encoding

2015-03-07 Thread Mark Lawrence
On 07/03/2015 12:02, Chris Angelico wrote: On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa ma...@pacujo.net wrote: The main dream was a fixed-width encoding scheme. People thought 16 bits would be enough. The dream is so precious and true to us in the West that people don't want to give it up.

Re: Newbie question about text encoding

2015-03-07 Thread Mark Lawrence
On 07/03/2015 11:09, Steven D'Aprano wrote: Rustom Mody wrote: This includes not just bug-prone-system code such as Java and Windows but seemingly working code such as python 3. What Unicode bugs do you think Python 3.3 and above have? Methinks somebody has been drinking too much loony

Re: Newbie question about text encoding

2015-03-07 Thread Steven D'Aprano
Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? As far as I am aware, every code point has one and only one valid UTF-8 encoding, and every UTF-8 encoding has one and only one valid code point. There are *invalid* UTF-8

Re: Newbie question about text encoding

2015-03-07 Thread Steven D'Aprano
Rustom Mody wrote: On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote: [...] Chris is suggesting that going from BMP to all of Unicode is not the hard part. Going from ASCII to the BMP part of Unicode is the hard part. If you can do that, you can go the rest of the way

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa ma...@pacujo.net wrote: The main dream was a fixed-width encoding scheme. People thought 16 bits would be enough. The dream is so precious and true to us in the West that people don't want to give it up. So... use Pike, or Python 3.3+? ChrisA --

Re: Newbie question about text encoding

2015-03-07 Thread Rustom Mody
On Saturday, March 7, 2015 at 11:41:53 AM UTC+5:30, Terry Reedy wrote: On 3/6/2015 11:20 AM, Rustom Mody wrote: = pp =  print (pp) = Try open it in idle3 and you get (at least I get): $ idle3 ff.py Traceback (most recent call last): File /usr/bin/idle3,

Re: Newbie question about text encoding

2015-03-07 Thread Rustom Mody
On Saturday, March 7, 2015 at 11:49:44 PM UTC+5:30, Mark Lawrence wrote: On 07/03/2015 17:16, Marko Rauhamaa wrote: Mark Lawrence: It would clearly help if you were to type in the correct UK English accent. Your ad-hominem-to-contribution ratio is alarmingly high. Marko

Re: Newbie question about text encoding

2015-03-07 Thread Steven D'Aprano
Marko Rauhamaa wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info: Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? In Python terms, there are bytes objects b that don't satisfy:

Re: Newbie question about text encoding

2015-03-06 Thread Chris Angelico
On Sat, Mar 7, 2015 at 1:03 AM, random...@fastmail.us wrote: On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote: Number of code points is the most logical way to length-limit something. If you want to allow users to set their display names but not to make arbitrarily long ones, limiting them

Re: Newbie question about text encoding

2015-03-06 Thread random832
On Fri, Mar 6, 2015, at 09:11, Chris Angelico wrote: To prevent people from putting three paragraphs of lipsum in and calling it a username. Limiting by UTF-8 bytes or UTF-16 units works just as well for that. So you truncate to the desired length, then if the first character of the

Re: Newbie question about text encoding

2015-03-06 Thread Chris Angelico
On Sat, Mar 7, 2015 at 1:50 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Rustom Mody wrote: On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: [snip example of an analogous situation with NULs] Strawman. Sigh. If I had a dollar for every time

Re: Newbie question about text encoding

2015-03-06 Thread random832
On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote: Number of code points is the most logical way to length-limit something. If you want to allow users to set their display names but not to make arbitrarily long ones, limiting them to X code points is the safest way (and preferably do an NFC

Re: Newbie question about text encoding

2015-03-06 Thread Steven D'Aprano
Rustom Mody wrote: On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: [snip example of an analogous situation with NULs] Strawman. Sigh. If I had a dollar for every time somebody cried Strawman! when what they really should say is Yes, that's a good argument, I'm afraid

Re: Newbie question about text encoding

2015-03-06 Thread Chris Angelico
On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody rustompm...@gmail.com wrote: Broken systems can be shown up by anything. Suppose you have a program that breaks when it gets a NUL character (not unknown in C code); is the fault with the Unicode consortium for allocating something at codepoint 0, or

Re: Newbie question about text encoding

2015-03-06 Thread random832
On Fri, Mar 6, 2015, at 04:06, Rustom Mody wrote: Also: Can a programmer who is away from UTF-16 in one part of the system (say by using python3) assume he is safe all over? The most common failure of UTF-16 support, supposedly, is in programs misusing the number of code units (for length or

Re: Newbie question about text encoding

2015-03-06 Thread Chris Angelico
On Sat, Mar 7, 2015 at 12:33 AM, random...@fastmail.us wrote: However, when do you _really_ want the number of characters? You may want to use it for, for example, the number of columns in a 'monospace' font, which you've already screwed up because you haven't accounted for double-wide

Re: Newbie question about text encoding

2015-03-06 Thread Steven D'Aprano
random...@fastmail.us wrote: My point is there are very few problems to which count of Unicode code points is the only right answer - that UTF-32 is good enough for but that are meaningfully impacted by a naive usage of UTF-16, to the point where UTF-16 is something you have to be safe from.

Re: Newbie question about text encoding

2015-03-06 Thread Rustom Mody
On Friday, March 6, 2015 at 8:20:22 PM UTC+5:30, Steven D'Aprano wrote: Rustom Mody wrote: On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: [snip example of an analogous situation with NULs] Strawman. Sigh. If I had a dollar for every time somebody cried

Re: Newbie question about text encoding

2015-03-06 Thread Chris Angelico
On Sat, Mar 7, 2015 at 3:20 AM, Rustom Mody rustompm...@gmail.com wrote: C's string is not bug-prone its plain buggy as it cannot represent strings with nulls. I would not go that far for UTF-16. It is bug-inviting but it can also be implemented correctly C's standard library string handling

Re: Newbie question about text encoding

2015-03-06 Thread Rustom Mody
On Friday, March 6, 2015 at 2:33:11 PM UTC+5:30, Rustom Mody wrote: Lets please stick to UTF-16 shall we? Now tell me: - Is it broken or not? - Is it widely used or not? - Should programmers be careful of it or not? - Should programmers be warned about it or not? Also: Can a programmer

Re: Newbie question about text encoding

2015-03-06 Thread Rustom Mody
On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody wrote: My conclusion: Early adopters of unicode -- Windows and Java -- were punished for their early adoption. You can blame the unicode consortium, you can blame the

Re: Newbie question about text encoding

2015-03-06 Thread Rustom Mody
On Friday, March 6, 2015 at 3:24:48 PM UTC+5:30, Chris Angelico wrote: On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody wrote: Broken systems can be shown up by anything. Suppose you have a program that breaks when it gets a NUL character (not unknown in C code); is the fault with the Unicode

Re: Newbie question about text encoding

2015-03-06 Thread Terry Reedy
On 3/6/2015 11:20 AM, Rustom Mody wrote: = pp =  print (pp) = Try open it in idle3 and you get (at least I get): $ idle3 ff.py Traceback (most recent call last): File /usr/bin/idle3, line 5, in module main() File /usr/lib/python3.4/idlelib/PyShell.py, line 1562, in

Re: Newbie question about text encoding

2015-03-05 Thread random832
On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote: I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8 and UTF-32, since that goes against the grain of the system. You would have to program in artificial restrictions that otherwise don't exist. UTF-8 is already

Re: Newbie question about text encoding

2015-03-05 Thread Steven D'Aprano
random...@fastmail.us wrote: On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote: I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8 and UTF-32, since that goes against the grain of the system. You would have to program in artificial restrictions that otherwise

Re: Newbie question about text encoding

2015-03-05 Thread Steven D'Aprano
Rustom Mody wrote: On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote: On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote: It lists some examples of software that somehow break/goof going from BMP-only unicode to 7.0 unicode. IOW the suggestion is that the the

Re: Newbie question about text encoding

2015-03-05 Thread Chris Angelico
On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody rustompm...@gmail.com wrote: My conclusion: Early adopters of unicode -- Windows and Java -- were punished for their early adoption. You can blame the unicode consortium, you can blame the babel of human languages, particularly that some use

Re: Newbie question about text encoding

2015-03-05 Thread Rustom Mody
On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote: Rustom Mody wrote: On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote: On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote: It lists some examples of software that somehow break/goof going

Re: Newbie question about text encoding

2015-03-03 Thread Chris Angelico
On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody rustompm...@gmail.com wrote: What I was trying to say expanded here http://blog.languager.org/2015/03/whimsical-unicode.html [Hope the word 'whimsical' is less jarring and more accurate than 'gibberish'] Re footnote #4: ½ is a single character for

Re: Newbie question about text encoding

2015-03-03 Thread Rustom Mody
On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: On 2/26/2015 8:24 AM, Chris Angelico wrote: On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: Wrote something up on why we should stop using ASCII: http://blog.languager.org/2015/02/universal-unicode.html I

Re: Newbie question about text encoding

2015-03-03 Thread Terry Reedy
On 3/3/2015 1:03 PM, Rustom Mody wrote: On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: You should add emoticons, but not call them or the above 'gibberish'. I think that this part of your post is more 'unprofessional' than the character blocks. It is very jarring

Re: Newbie question about text encoding

2015-03-03 Thread Rustom Mody
On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote: On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote: It lists some examples of software that somehow break/goof going from BMP-only unicode to 7.0 unicode. IOW the suggestion is that the the two-way

Re: Newbie question about text encoding

2015-03-03 Thread Rustom Mody
On Wednesday, March 4, 2015 at 12:14:11 AM UTC+5:30, Chris Angelico wrote: On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody wrote: What I was trying to say expanded here http://blog.languager.org/2015/03/whimsical-unicode.html [Hope the word 'whimsical' is less jarring and more accurate than

Re: Newbie question about text encoding

2015-03-03 Thread Steven D'Aprano
Rustom Mody wrote: On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: On 2/26/2015 8:24 AM, Chris Angelico wrote: On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: Wrote something up on why we should stop using ASCII:

Re: Newbie question about text encoding

2015-03-03 Thread Rustom Mody
On Wednesday, March 4, 2015 at 9:35:28 AM UTC+5:30, Rustom Mody wrote: On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote: Rustom Mody wrote: On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: On 2/26/2015 8:24 AM, Chris Angelico wrote:

Re: Newbie question about text encoding

2015-03-03 Thread Rustom Mody
On Wednesday, March 4, 2015 at 12:07:06 AM UTC+5:30, jmf wrote: Le mardi 3 mars 2015 19:04:06 UTC+1, Rustom Mody a écrit : On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: On 2/26/2015 8:24 AM, Chris Angelico wrote: On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody

Re: Newbie question about text encoding

2015-03-03 Thread Rustom Mody
On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote: Rustom Mody wrote: On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: On 2/26/2015 8:24 AM, Chris Angelico wrote: On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: Wrote something up

Re: Newbie question about text encoding

2015-03-03 Thread Chris Angelico
On Wed, Mar 4, 2015 at 1:54 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: It is easy to mock what is not important to you. I daresay kids adding emoji to their 10 character tweets would mock all the useless maths symbols in Unicode too. Definitely! Who ever sings do you wanna

Re: Newbie question about text encoding

2015-03-03 Thread Chris Angelico
On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody rustompm...@gmail.com wrote: It lists some examples of software that somehow break/goof going from BMP-only unicode to 7.0 unicode. IOW the suggestion is that the the two-way classification - ASCII - Unicode is less useful and accurate than the

Re: Newbie question about text encoding

2015-02-27 Thread Dave Angel
On 02/27/2015 06:54 AM, Steven D'Aprano wrote: Dave Angel wrote: On 02/27/2015 12:58 AM, Steven D'Aprano wrote: Dave Angel wrote: (Although I believe Seymour Cray was quoted as saying that virtual memory is a crock, because you can't fake what you ain't got.) If I recall correctly, disk

Re: Newbie question about text encoding

2015-02-27 Thread alister
On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote: On Sat, Feb 28, 2015 at 3:00 AM, alister alister.nospam.w...@ntlworld.com wrote: I think there is a case for bringing back the overlay file, or at least loading larger programs in sections only loading the routines as they are

Re: Newbie question about text encoding

2015-02-27 Thread Chris Angelico
On Sat, Feb 28, 2015 at 3:45 AM, alister alister.nospam.w...@ntlworld.com wrote: On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote: On Sat, Feb 28, 2015 at 3:00 AM, alister alister.nospam.w...@ntlworld.com wrote: I think there is a case for bringing back the overlay file, or at least

Re: Newbie question about text encoding

2015-02-27 Thread Grant Edwards
On 2015-02-27, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Dave Angel wrote: On 02/27/2015 12:58 AM, Steven D'Aprano wrote: Dave Angel wrote: (Although I believe Seymour Cray was quoted as saying that virtual memory is a crock, because you can't fake what you ain't got.)

Re: Newbie question about text encoding

2015-02-27 Thread Chris Angelico
On Sat, Feb 28, 2015 at 1:02 AM, Dave Angel da...@davea.name wrote: The term virtual memory is used for many aspects of the modern memory architecture. But I presume you're using it in the sense of running in a swapfile as opposed to running in physical RAM. Given that this started with a

Re: Newbie question about text encoding

2015-02-27 Thread Grant Edwards
On 2015-02-27, Grant Edwards invalid@invalid.invalid wrote: On 2015-02-27, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: Dave Angel wrote: On 02/27/2015 12:58 AM, Steven D'Aprano wrote: Dave Angel wrote: (Although I believe Seymour Cray was quoted as saying that virtual memory

Re: Newbie question about text encoding

2015-02-27 Thread alister
On Sat, 28 Feb 2015 01:22:15 +1100, Chris Angelico wrote: If you're trying to use the pagefile/swapfile as if it's more memory (I have 256MB of memory, but 10GB of swap space, so that's 10GB of memory!), then yes, these performance considerations are huge. But suppose you need to run a

Re: Newbie question about text encoding

2015-02-27 Thread Chris Angelico
On Sat, Feb 28, 2015 at 3:00 AM, alister alister.nospam.w...@ntlworld.com wrote: I think there is a case for bringing back the overlay file, or at least loading larger programs in sections only loading the routines as they are required could speed up the start time of many large applications.

Re: Newbie question about text encoding

2015-02-27 Thread Dave Angel
On 02/27/2015 09:22 AM, Chris Angelico wrote: On Sat, Feb 28, 2015 at 1:02 AM, Dave Angel da...@davea.name wrote: The term virtual memory is used for many aspects of the modern memory architecture. But I presume you're using it in the sense of running in a swapfile as opposed to running in

Re: Newbie question about text encoding

2015-02-27 Thread MRAB
On 2015-02-27 16:45, alister wrote: On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote: On Sat, Feb 28, 2015 at 3:00 AM, alister alister.nospam.w...@ntlworld.com wrote: I think there is a case for bringing back the overlay file, or at least loading larger programs in sections only

Re: Newbie question about text encoding

2015-02-27 Thread Dave Angel
On 02/27/2015 11:00 AM, alister wrote: On Sat, 28 Feb 2015 01:22:15 +1100, Chris Angelico wrote: If you're trying to use the pagefile/swapfile as if it's more memory (I have 256MB of memory, but 10GB of swap space, so that's 10GB of memory!), then yes, these performance considerations are

Re: Newbie question about text encoding

2015-02-27 Thread Chris Angelico
On Sat, Feb 28, 2015 at 7:52 AM, Dave Angel da...@davea.name wrote: If that's the case on the architectures you're talking about, then the problem of slow loading is not triggered by the memory usage, but by lots of initialization code. THAT's what should be deferred for seldom-used portions

Re: Newbie question about text encoding

2015-02-27 Thread alister
On Fri, 27 Feb 2015 19:14:00 +, MRAB wrote: I suppose you could load the basic parts first so that the user can start working, and then load the additional features in the background. quite possible my opinion on this is very fluid it may work for some applications, it probably wouldn't

  1   2   >