Re: Newbie question about text encoding

2015-03-09 Thread Rustom Mody
On Monday, March 9, 2015 at 12:05:05 PM UTC+5:30, Steven D'Aprano wrote: > Chris Angelico wrote: > > > As to the notion of rejecting the construction of strings containing > > these invalid codepoints, I'm not sure. Are there any languages out > > there that have a Unicode string type that require

Re: Newbie question about text encoding

2015-03-08 Thread Chris Angelico
On Mon, Mar 9, 2015 at 5:34 PM, Steven D'Aprano wrote: > Chris Angelico wrote: > >> As to the notion of rejecting the construction of strings containing >> these invalid codepoints, I'm not sure. Are there any languages out >> there that have a Unicode string type that requires that all >> codepoi

Re: Newbie question about text encoding

2015-03-08 Thread Steven D'Aprano
Chris Angelico wrote: > As to the notion of rejecting the construction of strings containing > these invalid codepoints, I'm not sure. Are there any languages out > there that have a Unicode string type that requires that all > codepoints be valid (no surrogates, no U+FFFE, etc)? U+FFFE and U+FFF

Re: Newbie question about text encoding

2015-03-08 Thread Marko Rauhamaa
Ben Finney : > Steven D'Aprano writes: > >> '\udd00' should be a SyntaxError. > > I find your argument convincing, that attempting to construct a > Unicode string of a lone surrogate should be an error. Then we're back to square one: >>> b'\x80'.decode('utf-8', errors='surrogateescape') '

Re: Newbie question about text encoding

2015-03-08 Thread random832
On Sun, Mar 8, 2015, at 22:09, Ben Finney wrote: > Steven D'Aprano writes: > > > '\udd00' should be a SyntaxError. > > I find your argument convincing, that attempting to construct a Unicode > string of a lone surrogate should be an error. > > Shouldn't the error type be a ValueError, though? T

Re: Newbie question about text encoding

2015-03-08 Thread Rustom Mody
On Monday, March 9, 2015 at 7:39:42 AM UTC+5:30, Cameron Simpson wrote: > On 07Mar2015 22:09, Steven D'Aprano wrote: > >Rustom Mody wrote: > >>[...big snip...] > >> Some parts are here some earlier and from my memory. > >> If details wrong please correct: > >> - 200 million records > >> - Containi

Re: Newbie question about text encoding

2015-03-08 Thread Chris Angelico
On Mon, Mar 9, 2015 at 1:09 PM, Ben Finney wrote: > Steven D'Aprano writes: > >> '\udd00' should be a SyntaxError. > > I find your argument convincing, that attempting to construct a Unicode > string of a lone surrogate should be an error. > > Shouldn't the error type be a ValueError, though? The

Re: Newbie question about text encoding

2015-03-08 Thread Ben Finney
Steven D'Aprano writes: > '\udd00' should be a SyntaxError. I find your argument convincing, that attempting to construct a Unicode string of a lone surrogate should be an error. Shouldn't the error type be a ValueError, though? The statement is not, to my mind, erroneous syntax. -- \ “P

Re: Newbie question about text encoding

2015-03-08 Thread Cameron Simpson
On 07Mar2015 22:09, Steven D'Aprano wrote: Rustom Mody wrote: [...big snip...] Some parts are here some earlier and from my memory. If details wrong please correct: - 200 million records - Containing 4 strings with SMP characters - System made with python and mysql. SMP works with python, brea

Re: Newbie question about text encoding

2015-03-08 Thread Steven D'Aprano
Marko Rauhamaa wrote: > Steven D'Aprano : > >> Marko Rauhamaa wrote: >>> '\udd00' is a valid str object: >> >> Is it though? Perhaps the bug is not UTF-8's inability to encode lone >> surrogates, but that Python allows you to create lone surrogates in >> the first place. That's not a rhetorical q

Re: Newbie question about text encoding

2015-03-08 Thread Chris Angelico
On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano wrote: > Perhaps the bug is not UTF-8's inability to encode lone > surrogates, but that Python allows you to create lone surrogates in the > first place. That's not a rhetorical question. It's a genuine question. As to the notion of rejecting the co

Re: Newbie question about text encoding

2015-03-08 Thread Chris Angelico
On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano wrote: > Marko Rauhamaa wrote: > >> Chris Angelico : >> >>> Once again, you appear to be surprised that invalid data is failing. >>> Why is this so strange? U+DD00 is not a valid character. > > But it is a valid non-character code point. > >>> It is

Re: Newbie question about text encoding

2015-03-08 Thread Marko Rauhamaa
Steven D'Aprano : > Marko Rauhamaa wrote: >> '\udd00' is a valid str object: > > Is it though? Perhaps the bug is not UTF-8's inability to encode lone > surrogates, but that Python allows you to create lone surrogates in > the first place. That's not a rhetorical question. It's a genuine > questio

Re: Newbie question about text encoding

2015-03-08 Thread Steven D'Aprano
Rustom Mody wrote: > On Saturday, March 7, 2015 at 4:39:48 PM UTC+5:30, Steven D'Aprano wrote: >> Rustom Mody wrote: >> > This includes not just bug-prone-system code such as Java and Windows >> > but seemingly working code such as python 3. >> >> What Unicode bugs do you think Python 3.3 and abo

Re: Newbie question about text encoding

2015-03-08 Thread Steven D'Aprano
Marko Rauhamaa wrote: > Chris Angelico : > >> Once again, you appear to be surprised that invalid data is failing. >> Why is this so strange? U+DD00 is not a valid character. But it is a valid non-character code point. >> It is quite correct to throw this error. > > '\udd00' is a valid str o

Re: Newbie question about text encoding

2015-03-08 Thread Chris Angelico
On Sun, Mar 8, 2015 at 7:09 PM, Marko Rauhamaa wrote: > Chris Angelico : > >> Once again, you appear to be surprised that invalid data is failing. >> Why is this so strange? U+DD00 is not a valid character. It is quite >> correct to throw this error. > > '\udd00' is a valid str object: > >>>>

Re: Newbie question about text encoding

2015-03-08 Thread Marko Rauhamaa
Chris Angelico : > Once again, you appear to be surprised that invalid data is failing. > Why is this so strange? U+DD00 is not a valid character. It is quite > correct to throw this error. '\udd00' is a valid str object: >>> '\udd00' '\udd00' >>> '\udd00'.encode('utf-32') b'\xff\xfe

Re: Newbie question about text encoding

2015-03-08 Thread Steven D'Aprano
Steven D'Aprano wrote: > Marko Rauhamaa wrote: > >> Steven D'Aprano : >> >>> Marko Rauhamaa wrote: >>> That said, UTF-8 does suffer badly from its not being a bijective mapping. >>> >>> Can you explain? >> >> In Python terms, there are bytes objects b that don't satisfy: >> >>b.d

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 6:20 PM, Marko Rauhamaa wrote: > * it still isn't bijective between str and bytes: > >>>> '\udd00'.encode('utf-8', errors='surrogateescape') >Traceback (most recent call last): > File "", line 1, in >UnicodeEncodeError: 'utf-8' codec can't encode character

Re: Newbie question about text encoding

2015-03-07 Thread Rustom Mody
On Saturday, March 7, 2015 at 4:39:48 PM UTC+5:30, Steven D'Aprano wrote: > Rustom Mody wrote: > > This includes not just bug-prone-system code such as Java and Windows but > > seemingly working code such as python 3. > > What Unicode bugs do you think Python 3.3 and above have? Literal/Legalisti

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Steven D'Aprano : > For those cases where you do wish to take an arbitrary byte stream and > round-trip it, Python now provides an error handler for that. > > py> import random > py> b = bytes([random.randint(0, 255) for _ in range(1)]) > py> s = b.decode('utf-8') > Traceback (most recent call

Re: Newbie question about text encoding

2015-03-07 Thread Rustom Mody
On Saturday, March 7, 2015 at 11:41:53 AM UTC+5:30, Terry Reedy wrote: > On 3/6/2015 11:20 AM, Rustom Mody wrote: > > > = > > pp = "💩" > > print (pp) > > = > > Try open it in idle3 and you get (at least I get): > > > > $ idle3 ff.py > > Traceback (most recent call last): > >Fil

Re: Newbie question about text encoding

2015-03-07 Thread Rustom Mody
On Saturday, March 7, 2015 at 11:49:44 PM UTC+5:30, Mark Lawrence wrote: > On 07/03/2015 17:16, Marko Rauhamaa wrote: > > Mark Lawrence: > > > >> It would clearly help if you were to type in the correct UK English > >> accent. > > > > Your ad-hominem-to-contribution ratio is alarmingly high. > > >

Re: Newbie question about text encoding

2015-03-07 Thread Steven D'Aprano
Marko Rauhamaa wrote: > Steven D'Aprano : > >> Marko Rauhamaa wrote: >> >>> That said, UTF-8 does suffer badly from its not being >>> a bijective mapping. >> >> Can you explain? > > In Python terms, there are bytes objects b that don't satisfy: > >b.decode('utf-8').encode('utf-8') == b Are

Re: Newbie question about text encoding

2015-03-07 Thread Dan Sommers
On Sat, 07 Mar 2015 19:00:47 +, Mark Lawrence wrote: > Isn't pathlib > https://docs.python.org/3/library/pathlib.html#module-pathlib > effectively a more recent attempt at smoothing or even removing (some > of) the bumps? Has anybody here got experience of it as I've never > used it? I almos

Re: Newbie question about text encoding

2015-03-07 Thread Albert-Jan Roskam
--- Original Message - > From: Chris Angelico > To: > Cc: "python-list@python.org" > Sent: Saturday, March 7, 2015 6:26 PM > Subject: Re: Newbie question about text encoding > > On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa wrote: >> See: >&

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Dan Sommers : > I think we're all agreeing: not all file systems are the same, and > Python doesn't smooth out all of the bumps, even for something that > seems as simple as displaying the names of files in a directory. And > that's *after* we've agreed that filesystems contain files in > hierarch

Re: Newbie question about text encoding

2015-03-07 Thread Mark Lawrence
On 07/03/2015 18:34, Dan Sommers wrote: On Sun, 08 Mar 2015 05:13:09 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers wrote: On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa wrote: Correct. Linux pathnames a

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 5:34 AM, Dan Sommers wrote: > I think we're all agreeing: not all file systems are the same, and > Python doesn't smooth out all of the bumps, even for something that > seems as simple as displaying the names of files in a directory. And > that's *after* we've agreed that

Re: Newbie question about text encoding

2015-03-07 Thread Dan Sommers
On Sun, 08 Mar 2015 05:13:09 +1100, Chris Angelico wrote: > On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers wrote: >> On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: >> >>> On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa wrote: >> Correct. Linux pathnames are octet strings regardless o

Re: Newbie question about text encoding

2015-03-07 Thread Mark Lawrence
On 07/03/2015 17:16, Marko Rauhamaa wrote: Mark Lawrence : It would clearly help if you were to type in the correct UK English accent. Your ad-hominem-to-contribution ratio is alarmingly high. Marko You've been a PITA ever since you first joined this list, what about it? -- My fellow Py

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers wrote: > On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: > >> On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa wrote: > >>> Correct. Linux pathnames are octet strings regardless of the locale. >>> >>> That's why Linux developers should refer to

Re: Newbie question about text encoding

2015-03-07 Thread Dan Sommers
On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote: > On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa wrote: >> Correct. Linux pathnames are octet strings regardless of the locale. >> >> That's why Linux developers should refer to filenames using bytes. >> Unfortunately, Python itself viola

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa wrote: >> There are two things happening here: >> >> 1) The underlying file system is not UTF-8, and you can't depend on >> that, > > Correct. Linux pathnames are octet strings regardless of the locale. > > That's why Linux developers should refer to

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Chris Angelico : > On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa wrote: >> File names encoded with Latin-X are quite commonplace even in UTF-8 >> locales. > > That is not a problem with UTF-8, though. I don't understand how > you're blaming UTF-8 for that. I'm saying it creates practical proble

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa wrote: > See: > >$ mkdir /tmp/xyz >$ touch /tmp/xyz/ > \x80' >$ python3 >Python 3.3.2 (default, Dec 4 2014, 12:49:00) >[GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux >Type "help", "copyright", "credits" or "license" for more

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Mark Lawrence : > It would clearly help if you were to type in the correct UK English > accent. Your ad-hominem-to-contribution ratio is alarmingly high. Marko -- https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Chris Angelico : > If you really REALLY can't use the bytes() type to work with something > that is, yaknow, bytes, then you could use an alternative encoding > that has a value for every byte. It's still not Unicode text, so it > doesn't much matter which encoding you use. But it's much better to

Re: Newbie question about text encoding

2015-03-07 Thread Mark Lawrence
On 07/03/2015 16:48, Marko Rauhamaa wrote: Mark Lawrence : On 07/03/2015 16:25, Marko Rauhamaa wrote: Here's an example: b = b'\x80' Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping from str objects to bytes objects. Python 2 might, Python 3 doesn't. Python

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 3:54 AM, Marko Rauhamaa wrote: > You can't operate on file names and text files using Python strings. Or > at least, you will need to add (nontrivial) exception catching logic. You can't operate on a JPG file using a Unicode string, nor an array of integers. What of it? You

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 3:54 AM, Marko Rauhamaa wrote: >> All you've proven is that there are bit patterns which are not UTF-8 >> streams... > > And that causes problems. Demonstrate. ChrisA -- https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Chris Angelico : > On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa wrote: > Marko Rauhamaa wrote: >> That said, UTF-8 does suffer badly from its not being >> a bijective mapping. > >> Here's an example: >> >>b = b'\x80' >> >> Yes, it generates an exception. IOW, UTF-8 is not a

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 3:40 AM, Mark Lawrence wrote: >> Here's an example: >> >> b = b'\x80' >> >> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping >> from str objects to bytes objects. >> > > Python 2 might, Python 3 doesn't. He was talking about this line of code: b.de

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Mark Lawrence : > On 07/03/2015 16:25, Marko Rauhamaa wrote: >> Here's an example: >> >> b = b'\x80' >> >> Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping >> from str objects to bytes objects. > > Python 2 might, Python 3 doesn't. Python 3.3.2 (default, Dec 4 2014, 1

Re: Newbie question about text encoding

2015-03-07 Thread Mark Lawrence
On 07/03/2015 16:25, Marko Rauhamaa wrote: Chris Angelico : On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa wrote: Steven D'Aprano : Marko Rauhamaa wrote: That said, UTF-8 does suffer badly from its not being a bijective mapping. Can you explain? In Python terms, there are bytes object

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa wrote: > Chris Angelico : > >> On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa wrote: >>> Steven D'Aprano : >>> Marko Rauhamaa wrote: > That said, UTF-8 does suffer badly from its not being > a bijective mapping. Can you ex

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Chris Angelico : > On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa wrote: >> Steven D'Aprano : >> >>> Marko Rauhamaa wrote: >>> That said, UTF-8 does suffer badly from its not being a bijective mapping. >>> >>> Can you explain? >> >> In Python terms, there are bytes objects b that don't

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa wrote: > Steven D'Aprano : > >> Marko Rauhamaa wrote: >> >>> That said, UTF-8 does suffer badly from its not being >>> a bijective mapping. >> >> Can you explain? > > In Python terms, there are bytes objects b that don't satisfy: > >b.decode('utf-

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Steven D'Aprano : > Marko Rauhamaa wrote: > >> That said, UTF-8 does suffer badly from its not being >> a bijective mapping. > > Can you explain? In Python terms, there are bytes objects b that don't satisfy: b.decode('utf-8').encode('utf-8') == b Marko -- https://mail.python.org/mailman/l

Re: Newbie question about text encoding

2015-03-07 Thread Steven D'Aprano
Marko Rauhamaa wrote: > That said, UTF-8 does suffer badly from its not being > a bijective mapping. Can you explain? As far as I am aware, every code point has one and only one valid UTF-8 encoding, and every UTF-8 encoding has one and only one valid code point. There are *invalid* UTF-8 encod

Re: Newbie question about text encoding

2015-03-07 Thread Mark Lawrence
On 07/03/2015 11:09, Steven D'Aprano wrote: Rustom Mody wrote: This includes not just bug-prone-system code such as Java and Windows but seemingly working code such as python 3. What Unicode bugs do you think Python 3.3 and above have? Methinks somebody has been drinking too much loony ju

Re: Newbie question about text encoding

2015-03-07 Thread Mark Lawrence
On 07/03/2015 12:02, Chris Angelico wrote: On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa wrote: The main dream was a fixed-width encoding scheme. People thought 16 bits would be enough. The dream is so precious and true to us in the West that people don't want to give it up. So... use Pike,

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa wrote: > The main dream was a fixed-width encoding scheme. People thought 16 bits > would be enough. The dream is so precious and true to us in the West > that people don't want to give it up. So... use Pike, or Python 3.3+? ChrisA -- https://mail

Re: Newbie question about text encoding

2015-03-07 Thread Marko Rauhamaa
Steven D'Aprano : > Rustom Mody wrote: >> My conclusion: Early adopters of unicode -- Windows and Java -- were >> punished for their early adoption. You can blame the unicode >> consortium, you can blame the babel of human languages, particularly >> that some use characters and some only (the equi

Re: Newbie question about text encoding

2015-03-07 Thread Chris Angelico
On Sat, Mar 7, 2015 at 10:09 PM, Steven D'Aprano wrote: > Stop using MySQL, which is a joke of a database[1], and use Postgres which > does not have this problem. I agree with the recommendation, though to be fair to MySQL, it is now possible to store full Unicode. Though personally, I think the

Re: Newbie question about text encoding

2015-03-07 Thread Steven D'Aprano
Rustom Mody wrote: > On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote: [...] >> Chris is suggesting that going from BMP to all of Unicode is not the hard >> part. Going from ASCII to the BMP part of Unicode is the hard part. If >> you can do that, you can go the rest of the

Re: Newbie question about text encoding

2015-03-06 Thread Terry Reedy
On 3/6/2015 11:20 AM, Rustom Mody wrote: = pp = "💩" print (pp) = Try open it in idle3 and you get (at least I get): $ idle3 ff.py Traceback (most recent call last): File "/usr/bin/idle3", line 5, in main() File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in m

Re: Newbie question about text encoding

2015-03-06 Thread Rustom Mody
On Friday, March 6, 2015 at 8:20:22 PM UTC+5:30, Steven D'Aprano wrote: > Rustom Mody wrote: > > > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: > > [snip example of an analogous situation with NULs] > > > Strawman. > > Sigh. If I had a dollar for every time somebody c

Re: Newbie question about text encoding

2015-03-06 Thread Steven D'Aprano
random...@fastmail.us wrote: > My point is there are very few > problems to which "count of Unicode code points" is the only right > answer - that UTF-32 is good enough for but that are meaningfully > impacted by a naive usage of UTF-16, to the point where UTF-16 is > something you have to be "saf

Re: Newbie question about text encoding

2015-03-06 Thread Chris Angelico
On Sat, Mar 7, 2015 at 3:20 AM, Rustom Mody wrote: > C's string is not bug-prone its plain buggy as it cannot represent strings > with nulls. > > I would not go that far for UTF-16. > It is bug-inviting but it can also be implemented correctly C's standard library string handling functions are re

Re: Newbie question about text encoding

2015-03-06 Thread Chris Angelico
On Sat, Mar 7, 2015 at 1:50 AM, Steven D'Aprano wrote: > Rustom Mody wrote: > >> On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: > > [snip example of an analogous situation with NULs] > >> Strawman. > > Sigh. If I had a dollar for every time somebody cried "Strawman!" when

Re: Newbie question about text encoding

2015-03-06 Thread Steven D'Aprano
Rustom Mody wrote: > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: [snip example of an analogous situation with NULs] > Strawman. Sigh. If I had a dollar for every time somebody cried "Strawman!" when what they really should say is "Yes, that's a good argument, I'm afr

Re: Newbie question about text encoding

2015-03-06 Thread random832
On Fri, Mar 6, 2015, at 09:11, Chris Angelico wrote: > To prevent people from putting three paragraphs of lipsum in and > calling it a username. Limiting by UTF-8 bytes or UTF-16 units works just as well for that. > So you truncate to the desired length, then if the first character of > the trimm

Re: Newbie question about text encoding

2015-03-06 Thread Chris Angelico
On Sat, Mar 7, 2015 at 1:03 AM, wrote: > On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote: >> Number of code points is the most logical way to length-limit >> something. If you want to allow users to set their display names but >> not to make arbitrarily long ones, limiting them to X code poin

Re: Newbie question about text encoding

2015-03-06 Thread random832
On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote: > Number of code points is the most logical way to length-limit > something. If you want to allow users to set their display names but > not to make arbitrarily long ones, limiting them to X code points is > the safest way (and preferably do an N

Re: Newbie question about text encoding

2015-03-06 Thread Chris Angelico
On Sat, Mar 7, 2015 at 12:33 AM, wrote: > However, when do you _really_ want the number of characters? You may > want to use it for, for example, the number of columns in a 'monospace' > font, which you've already screwed up because you haven't accounted for > double-wide characters or combining

Re: Newbie question about text encoding

2015-03-06 Thread random832
On Fri, Mar 6, 2015, at 04:06, Rustom Mody wrote: > Also: > Can a programmer who is away from UTF-16 in one part of the system (say > by using python3) > assume he is safe all over? The most common failure of UTF-16 support, supposedly, is in programs misusing the number of code units (for length

Re: Newbie question about text encoding

2015-03-06 Thread Rustom Mody
On Friday, March 6, 2015 at 3:24:48 PM UTC+5:30, Chris Angelico wrote: > On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody wrote: > >> Broken systems can be shown up by anything. Suppose you have a program > >> that breaks when it gets a NUL character (not unknown in C code); is > >> the fault with the U

Re: Newbie question about text encoding

2015-03-06 Thread Chris Angelico
On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody wrote: >> Broken systems can be shown up by anything. Suppose you have a program >> that breaks when it gets a NUL character (not unknown in C code); is >> the fault with the Unicode consortium for allocating something at >> codepoint 0, or the code that

Re: Newbie question about text encoding

2015-03-06 Thread Rustom Mody
On Friday, March 6, 2015 at 2:33:11 PM UTC+5:30, Rustom Mody wrote: > Lets please stick to UTF-16 shall we? > > Now tell me: > - Is it broken or not? > - Is it widely used or not? > - Should programmers be careful of it or not? > - Should programmers be warned about it or not? Also: Can a program

Re: Newbie question about text encoding

2015-03-06 Thread Rustom Mody
On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote: > On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody wrote: > > My conclusion: Early adopters of unicode -- Windows and Java -- were > > punished > > for their early adoption. You can blame the unicode consortium, you can > > blame

Re: Newbie question about text encoding

2015-03-05 Thread Chris Angelico
On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody wrote: > My conclusion: Early adopters of unicode -- Windows and Java -- were punished > for their early adoption. You can blame the unicode consortium, you can > blame the babel of human languages, particularly that some use characters > and some only

Re: Newbie question about text encoding

2015-03-05 Thread Rustom Mody
On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote: > Rustom Mody wrote: > > > On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote: > >> On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote: > >> > > >> > It lists some examples of software that somehow bre

Re: Newbie question about text encoding

2015-03-05 Thread Steven D'Aprano
random...@fastmail.us wrote: > On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote: >> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in >> UTF-8 >> and UTF-32, since that goes against the grain of the system. You would >> have >> to program in artificial restrictions that ot

Re: Newbie question about text encoding

2015-03-05 Thread random832
On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote: > I mostly agree with Chris. Supporting *just* the BMP is non-trivial in > UTF-8 > and UTF-32, since that goes against the grain of the system. You would > have > to program in artificial restrictions that otherwise don't exist. UTF-8 is alread

Re: Newbie question about text encoding

2015-03-05 Thread Steven D'Aprano
Rustom Mody wrote: > On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote: >> On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote: >> > >> > It lists some examples of software that somehow break/goof going from >> > BMP-only unicode to 7.0 unicode. >> > >> > IOW the suggestion

Re: Newbie question about text encoding

2015-03-03 Thread Rustom Mody
On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote: > On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote: > > > > It lists some examples of software that somehow break/goof going from > > BMP-only > > unicode to 7.0 unicode. > > > > IOW the suggestion is that the the two-way

Re: Newbie question about text encoding

2015-03-03 Thread Chris Angelico
On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote: > > It lists some examples of software that somehow break/goof going from BMP-only > unicode to 7.0 unicode. > > IOW the suggestion is that the the two-way classification > - ASCII > - Unicode > > is less useful and accurate than the 3-way > > - A

Re: Newbie question about text encoding

2015-03-03 Thread Rustom Mody
On Wednesday, March 4, 2015 at 12:07:06 AM UTC+5:30, jmf wrote: > Le mardi 3 mars 2015 19:04:06 UTC+1, Rustom Mody a écrit : > > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: > > > On 2/26/2015 8:24 AM, Chris Angelico wrote: > > > > On Thu, Feb 26, 2015 at 11:40 PM, Rus

Re: Newbie question about text encoding

2015-03-03 Thread Rustom Mody
On Wednesday, March 4, 2015 at 9:35:28 AM UTC+5:30, Rustom Mody wrote: > On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote: > > Rustom Mody wrote: > > > > > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: > > >> On 2/26/2015 8:24 AM, Chris Angelic

Re: Newbie question about text encoding

2015-03-03 Thread Rustom Mody
On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote: > Rustom Mody wrote: > > > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: > >> On 2/26/2015 8:24 AM, Chris Angelico wrote: > >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: > >> >> Wrot

Re: Newbie question about text encoding

2015-03-03 Thread Chris Angelico
On Wed, Mar 4, 2015 at 1:54 PM, Steven D'Aprano wrote: > It is easy to mock what is not important to you. I daresay kids adding emoji > to their 10 character tweets would mock all the useless maths symbols in > Unicode too. Definitely! Who ever sings "do you wanna build an integral sign"? ChrisA

Re: Newbie question about text encoding

2015-03-03 Thread Rustom Mody
On Wednesday, March 4, 2015 at 12:14:11 AM UTC+5:30, Chris Angelico wrote: > On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody wrote: > > What I was trying to say expanded here > > http://blog.languager.org/2015/03/whimsical-unicode.html > > [Hope the word 'whimsical' is less jarring and more accurate t

Re: Newbie question about text encoding

2015-03-03 Thread Steven D'Aprano
Rustom Mody wrote: > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: >> On 2/26/2015 8:24 AM, Chris Angelico wrote: >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: >> >> Wrote something up on why we should stop using ASCII: >> >> http://blog.languager.org/2015/

Re: Newbie question about text encoding

2015-03-03 Thread Terry Reedy
On 3/3/2015 1:03 PM, Rustom Mody wrote: On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: You should add emoticons, but not call them or the above 'gibberish'. I think that this part of your post is more 'unprofessional' than the character blocks. It is very jarring a

Re: Newbie question about text encoding

2015-03-03 Thread Chris Angelico
On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody wrote: > What I was trying to say expanded here > http://blog.languager.org/2015/03/whimsical-unicode.html > [Hope the word 'whimsical' is less jarring and more accurate than > 'gibberish'] Re footnote #4: ½ is a single character for compatibility rea

Re: Newbie question about text encoding

2015-03-03 Thread Rustom Mody
On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote: > On 2/26/2015 8:24 AM, Chris Angelico wrote: > > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote: > >> Wrote something up on why we should stop using ASCII: > >> http://blog.languager.org/2015/02/universal-unicode.html

Re: Newbie question about text encoding

2015-02-27 Thread alister
On Sat, 28 Feb 2015 04:45:04 +1100, Chris Angelico wrote: > Perhaps, but on the other hand, the skill of squeezing code into less > memory is being replaced by other skills. We can write code that takes > the simple/dumb approach, let it use an entire megabyte of memory, and > not care about the co

Re: Newbie question about text encoding

2015-02-27 Thread alister
On Fri, 27 Feb 2015 19:14:00 +, MRAB wrote: >> > I suppose you could load the basic parts first so that the user can > start working, and then load the additional features in the background. > quite possible my opinion on this is very fluid it may work for some applications, it probably would

Re: Newbie question about text encoding

2015-02-27 Thread Chris Angelico
On Sat, Feb 28, 2015 at 7:52 AM, Dave Angel wrote: > If that's the case on the architectures you're talking about, then the > problem of slow loading is not triggered by the memory usage, but by lots of > initialization code. THAT's what should be deferred for seldom-used > portions of code. s/s

Re: Newbie question about text encoding

2015-02-27 Thread Dave Angel
On 02/27/2015 11:00 AM, alister wrote: On Sat, 28 Feb 2015 01:22:15 +1100, Chris Angelico wrote: If you're trying to use the pagefile/swapfile as if it's more memory ("I have 256MB of memory, but 10GB of swap space, so that's 10GB of memory!"), then yes, these performance considerations are hu

Re: Newbie question about text encoding

2015-02-27 Thread MRAB
On 2015-02-27 16:45, alister wrote: On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote: On Sat, Feb 28, 2015 at 3:00 AM, alister wrote: I think there is a case for bringing back the overlay file, or at least loading larger programs in sections only loading the routines as they are requi

Re: Newbie question about text encoding

2015-02-27 Thread Grant Edwards
On 2015-02-27, Grant Edwards wrote: > On 2015-02-27, Steven D'Aprano wrote: > Dave Angel wrote: >>> On 02/27/2015 12:58 AM, Steven D'Aprano wrote: Dave Angel wrote: > (Although I believe Seymour Cray was quoted as saying that virtual > memory is a crock, because "you can't fake what

Re: Newbie question about text encoding

2015-02-27 Thread Grant Edwards
On 2015-02-27, Steven D'Aprano wrote: > Dave Angel wrote: > >> On 02/27/2015 12:58 AM, Steven D'Aprano wrote: >>> Dave Angel wrote: >>> (Although I believe Seymour Cray was quoted as saying that virtual memory is a crock, because "you can't fake what you ain't got.") >>> >>> If I recall

Re: Newbie question about text encoding

2015-02-27 Thread Chris Angelico
On Sat, Feb 28, 2015 at 3:45 AM, alister wrote: > On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote: > >> On Sat, Feb 28, 2015 at 3:00 AM, alister >> wrote: >>> I think there is a case for bringing back the overlay file, or at least >>> loading larger programs in sections only loading the

Re: Newbie question about text encoding

2015-02-27 Thread alister
On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote: > On Sat, Feb 28, 2015 at 3:00 AM, alister > wrote: >> I think there is a case for bringing back the overlay file, or at least >> loading larger programs in sections only loading the routines as they >> are required could speed up the star

Re: Newbie question about text encoding

2015-02-27 Thread Chris Angelico
On Sat, Feb 28, 2015 at 3:00 AM, alister wrote: > I think there is a case for bringing back the overlay file, or at least > loading larger programs in sections > only loading the routines as they are required could speed up the start > time of many large applications. > examples libre office, I ra

Re: Newbie question about text encoding

2015-02-27 Thread alister
On Sat, 28 Feb 2015 01:22:15 +1100, Chris Angelico wrote: > > If you're trying to use the pagefile/swapfile as if it's more memory ("I > have 256MB of memory, but 10GB of swap space, so that's 10GB of > memory!"), then yes, these performance considerations are huge. But > suppose you need to run

Re: Newbie question about text encoding

2015-02-27 Thread Dave Angel
On 02/27/2015 09:22 AM, Chris Angelico wrote: On Sat, Feb 28, 2015 at 1:02 AM, Dave Angel wrote: The term "virtual memory" is used for many aspects of the modern memory architecture. But I presume you're using it in the sense of "running in a swapfile" as opposed to running in physical RAM.

Re: Newbie question about text encoding

2015-02-27 Thread Chris Angelico
On Sat, Feb 28, 2015 at 1:02 AM, Dave Angel wrote: > The term "virtual memory" is used for many aspects of the modern memory > architecture. But I presume you're using it in the sense of "running in a > swapfile" as opposed to running in physical RAM. Given that this started with a quote about "

  1   2   >