ANN: Advanced Python Training at PyCon PL
Advanced Python Training at PyCon PL You have intermediate Python skills and would like learn more about: * Comprehensions * Decorators * Context managers * Descriptors * Metaclasses and * Patterns? Than you should attend this two-day training that provides a systematic coverage of these topics. Useful code samples and exercises provide hands-on learning. We offered this training at EuroPython 2012 and got very good feedback. Some of the participant understood much more of the complex topics than they anticipated. Date: September 17th and 18th, 2012 Location: PyCon PL venue, Mąchocice, Poland More information: http://pl.pycon.org/2012/en/training This is an open course, but PyCon PL attendees will get a considerable discount. Open courses 2012 and 2013 (till June) -- 17.09.-18.09.2012 (Mąchocice, Poland) Advanced Python at PyCon PL (English) http://pl.pycon.org/2012/en/training 15.10.-17.10.2012 (Leipzig) Introduction to Django (English) http://python-academy.com/courses/django_course_introduction.html 18.10.-20.10.2012 (Leipzig) Advanced Django (English) http://python-academy.com/courses/django_course_advanced.html 27.10.2012 (Leipzig) SQLAlchemy (English) http://python-academy.com/courses/specialtopics/python_course_sqlalchemy.html 28.10.2012 (Leipzig) Camelot (English) http://python-academy.com/courses/specialtopics/python_course_camelot.html 12.-14.11.2012 (Antwerp, Belgium) Python for Programmers (English) http://python-academy.com/courses/python_course_programmers.htm 15.11.2012 (Antwerp, Belgium) SQLAlchemy (English) http://python-academy.com/courses/specialtopics/python_course_sqlalchemy.html 16.11.2012 (Antwerp, Belgium) Camelot (English) http://python-academy.com/courses/specialtopics/python_course_camelot.html 10.12.-12.12.2012 (Leipzig) Python für Programmierer (German) http://www.python-academy.de/Kurse/python_kurs_programmierer.html 13.12.-15.12.2012 (Leipzig) Python für Wissenschaftler und Ingenieure (German) http://www.python-academy.de/Kurse/python_kurs_wissenschaftler.html 25.01.-27.01.2013 (Leipzig) Advanced Python (English) http://python-academy.com/courses/specialtopics/python_course_advanced.html 28.01.-30.01.2013 (Leipzig) High-Performance Computation with Python (English) http://python-academy.com/courses/python_course_high_performance.html one day each (can be booked separately) - Optimizing of Python Programs http://python-academy.com/courses/specialtopics/python_optimizing.html - Python Extensions with Other Languages http://python-academy.com/courses/specialtopics/python_extensions.html - Fast Code with the Cython Compiler http://python-academy.com/courses/specialtopics/python_course_cython.html 31.01.-01.02.2013 (Leipzig) High Performance XML with Python (English) http://python-academy.com/courses/specialtopics/python_course_xml.html 04.03.-08.03.2013 (Chicago, USA) Python for Scientists and Engineers (English) http://www.dabeaz.com/chicago/science.html 15.04.-17.04.2013 (Leipzig) Python für Programmierer (German) http://www.python-academy.de/Kurse/python_kurs_programmierer.html 18.04.-20.04.2013 (Leipzig) Python für Wissenschaftler und Ingenieure (German) http://www.python-academy.de/Kurse/python_kurs_wissenschaftler.html 10.06.-12.06.2013 (Leipzig) Python for Scientists and Engineers (English) http://python-academy.com/courses/python_course_scientists.html 13.06.2013 (Leipzig) Fast Code with the Cython Compiler (English) http://python-academy.com/courses/specialtopics/python_course_cython.html 14.06.2013 (Leipzig) Fast NumPy Processing with Cython (English) http://python-academy.com/courses/specialtopics/python_course_numpy_cython.html -- http://mail.python.org/mailman/listinfo/python-announce-list Support the Python Software Foundation: http://www.python.org/psf/donations/
Re: How do I display unicode value stored in a string variable using ord()
Chris Angelico ros...@gmail.com writes: Generally, I'm working with pure ASCII, but port those same algorithms to Python and you'll easily be able to read in a file in some known encoding and manipulate it as Unicode. If it's pure ASCII, you can use the bytes or bytearray type. It's not so much 'random access to the nth character' as an efficient way of jumping forward. For instance, if I know that the next thing is a literal string of n characters (that I don't care about), I want to skip over that and keep parsing. I don't understand how this is supposed to work. You're going to read a large unicode text file (let's say it's UTF-8) into a single big string? So the runtime library has to scan the encoded contents to find the highest numbered codepoint (let's say it's mostly ascii but has a few characters outside the BMP), expand it all (in this case) to UCS-4 giving 4x memory bloat and requiring decoding all the UTF-8 regardless, and now we should worry about the efficiency of skipping n characters? Since you have to decode the n characters regardless, I'd think this skipping part should only be an issue if you have to do it a lot of times. -- http://mail.python.org/mailman/listinfo/python-list
Re: Encapsulation, inheritance and polymorphism
On Tuesday, July 17, 2012 12:39:53 PM UTC-7, Mark Lawrence wrote: I would like to spend more time on this thread, but unfortunately the 44 ton artic carrying Java in a Nutshell Volume 1 Part 1 Chapter 1 Paragraph 1 Sentence 1 has just arrived outside my abode and needs unloading :-) That reminds me of a remark I made nearly 10 years ago: Well, I followed one friend's advice and investigated Java, perhaps a little too quickly. I purchased Ivor Horton's _Beginning_Java_2_ book. It is reasonably well-written. But how many pages did I have to read before I got through everything I needed to know, in order to read and write files? Four hundred! I need to keep straight detailed information about objects, inheritance, exceptions, buffers, and streams, just to read data from a text file??? I haven't actually sat down to program in Java yet. But at first glance, it would seem to be a step backwards even from the procedural C programming that I was doing a decade ago. I was willing to accept the complexity of the Windows GUI, and program with manuals open on my lap. It is a lot harder for me to accept that I will need to do this in order to process plain old text, perhaps without even any screen output. https://groups.google.com/d/topic/bionet.software/kk-EGGTHN1M/discussion Some things never change! :^) -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
This is a long post. If you don't feel like reading an essay, skip to the very bottom and read my last few paragraphs, starting with To recap. On Sat, 18 Aug 2012 11:26:21 -0700, Paul Rubin wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters using two code points. This is fragile and doesn't work very well, because string-handling methods can break the surrogate pairs apart, leaving you with invalid unicode string. Not good.) ... With PEP 393, each Python string will be stored in the most efficient format possible: Can you explain the issue of breaking surrogate pairs apart a little more? Switching between encodings based on the string contents seems silly at first glance. Forget encodings! We're not talking about encodings. Encodings are used for converting text as bytes for transmission over the wire or storage on disk. PEP 393 talks about the internal representation of text within Python, the C-level data structure. In 3.2, that data structure depends on a compile-time switch. In a narrow build, text is stored using two-bytes per character, so the string len (as in the name of the built-in function) will be stored as 006c 0065 006e (or possibly 6c00 6500 6e00, depending on whether your system is LittleEndian or BigEndian), plus object-overhead, which I shall ignore. Since most identifiers are ASCII, that's already using twice as much memory as needed. This standard data structure is called UCS-2, and it only handles characters in the Basic Multilingual Plane, the BMP (roughly the first 64000 Unicode code points). I'll come back to that. In a wide build, text is stored as four-bytes per character, so len is stored as either: 006c 0065 006e 6c00 6500 6e00 Now memory is cheap, but it's not *that* cheap, and no matter how much memory you have, you can always use more. This system is called UCS-4, and it can handle the entire Unicode character set, for now and forever. (If we ever need more that four-bytes worth of characters, it won't be called Unicode.) Remember I said that UCS-2 can only handle the 64K characters [technically: code points] in the Basic Multilingual Plane? There's an extension to UCS-2 called UTF-16 which extends it to the entire Unicode range. Yes, that's the same name as the UTF-16 encoding, because it's more or less the same system. UTF-16 says let's represent characters in the BMP by two bytes, but characters outside the BMP by four bytes. There's a neat trick to this: the BMP doesn't use the entire two-byte range, so there are some byte pairs which are illegal in UCS-2 -- they don't correspond to *any* character. UTF-16 used those byte pairs to signal this is half a character, you need to look at the next pair for the rest of the character. Nifty hey? These pairs-of-pseudocharacters are called surrogate pairs. Except this comes at a big cost: you can no longer tell how long a string is by counting the number of bytes, which is fast, because sometimes four bytes is two characters and sometimes it's one and you can't tell which it will be until you actually inspect all four bytes. Copying sub-strings now becomes either slow, or buggy. Say you want to grab the 10th characters in a string. The fast way using UCS-2 is to simply grab bytes 8 and 9 (remember characters are pairs of bytes and we start counting at zero) and you're done. Fast and safe if you're willing to give up the non-BMP characters. It's also fast and safe if you use USC-4, but then everything takes twice as much space, so you probably end up spending so much time copying null bytes that you're probably slower anyway. Especially when your OS starts paging memory like mad. But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8 and 9 are half of a surrogate pair, and you've now split the pair and ended up with an invalid string. That's what Python 3.2 does, it fails to handle surrogate pairs properly: py s = chr(0x + 1) py a, b = s py a '\ud800' py b '\udc00' I've just split a single valid Unicode character into two invalid characters. Python3.2 will (probably) mindless process those two non- characters, and the only sign I have that I did something wrong is that my data is now junk. Since any character can be a surrogate pair, you have to scan every pair of bytes in order to index a string, or work out it's length, or copy a substring. It's not enough to just check if the last pair is a surrogate. When you don't, you have bugs like this from Python 3.2: py s = 01234 + chr(0x + 1) + 6789 py s[9] == '9' False py s[9], len(s) ('8', 11) Which is now fixed in Python 3.3. So variable-width data structures like UTF-8 or UTF-16 are crap for the internal representation of strings -- they are either fast or correct but cannot be both. But UCS-2 is sub-optimal, because it can only handle the BMP, and UCS-4 is
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 11:30:19 -0700, wxjmfauth wrote: I'm aware of this (and all the blah blah blah you are explaining). This always the same song. Memory. Exactly. The reason it is always the same song is because it is an important song. No offense here. But this is an *american* answer. I am not American. I am not aware that computers outside of the USA, and Australia, have unlimited amounts of memory. You must be very lucky. The same story as the coding of text files, where utf-8 == ascii and the rest of the world doesn't count. UTF-8 is not ASCII. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393: The change does not just benefit ASCII users. It primarily benefits anybody using a wide unicode build with strings mostly containing only BMP characters. Just to be clear: If you have many strings which are *mostly* BMP, but have one or two non- BMP characters in *each* string, you will see no benefit. But if you have many strings which are all BMP, and only a few strings containing non-BMP characters, then you will see a big benefit. Even for narrow build users, there is the benefit that with approximately the same amount of memory usage in most cases, they no longer have to worry about non-BMP characters sneaking in and breaking their code. Yes! +1000 on that. There is some additional benefit for Latin-1 users, but this has nothing to do with Python. If Python is going to have the option of a 1-byte representation (and as long as we have the flexible representation, I can see no reason not to), The PEP explicitly states that it only uses a 1-byte format for ASCII strings, not Latin-1: ASCII-only Unicode strings will again use only one byte per character and later: If the maximum character is less than 128, they use the PyASCIIObject structure and: The data and utf8 pointers point to the same memory if the string uses only ASCII characters (using only Latin-1 is not sufficient). then it is going to be Latin-1 by definition, Certainly not, either in fact or in principle. There are a large number of 1-byte encodings, Latin-1 is hardly the only one. because that's what 1-byte Unicode (UCS-1, if you will) is. If you have an issue with that, take it up with the designers of Unicode. The designers of Unicode have never created a standard 1-byte Unicode or UCS-1, as far as I can determine. The Unicode standard refers to some multiple million code points, far too many to fit in a single byte. There is some historical justification for using Unicode to mean UCS-2, but with the standard being extended beyond the BMP, that is no longer valid. See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details. I think what you are trying to say is that the Unicode designers deliberately matched the Latin-1 standard for Unicode's first 256 code points. That's not the same thing though: there is no Unicode standard mapping to a single byte format. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote: As I understand (I think) the undelying mechanism, I can only say, it is not a surprise that it happens. Imagine an editor, I type an a, internally the text is saved as ascii, then I type en é, the text can only be saved in at least latin-1. Then I enter an €, the text become an internal ucs-4 string. The remove the € and so on. Firstly, that is not what Python does. For starters, € is in the BMP, and so is nearly every character you're ever going to use unless you are Asian or a historian using some obscure ancient script. NONE of the examples you have shown in your emails have included 4-byte characters, they have all been ASCII or UCS-2. You are suffering from a misunderstanding about what is going on and misinterpreting what you have seen. In *both* Python 3.2 and 3.3, both é and € are represented by two bytes. That will not change. There is a tiny amount of fixed overhead for strings, and that overhead is slightly different between the versions, but you'll never notice the difference. Secondly, how a text editor or word processor chooses to store the text that you type is not the same as how Python does it. A text editor is not going to be creating a new immutable string after every key press. That will be slow slow SLOW. The usual way is to keep a buffer for each paragraph, and add and subtract characters from the buffer. Intuitively I expect there is some kind slow down between all these strings conversion. Your intuition is wrong. Strings are not converted from ASCII to USC-2 to USC-4 on the fly, they are converted once, when the string is created. The tests we ran earlier, e.g.: ('ab…' * 1000).replace('…', 'œ…') show the *worst possible case* for the new string handling, because all we do is create new strings. First we create a string 'ab…', then we create another string 'ab…'*1000, then we create two new strings '…' and 'œ…', and finally we call replace and create yet another new string. But in real applications, once you have created a string, you don't just immediately create a new one and throw the old one away. You likely do work with that string: steve@runes:~$ python3.2 -m timeit s = 'abcœ…'*1000; n = len(s); flag = s.startswith(('*', 'a')) 10 loops, best of 3: 2.41 usec per loop steve@runes:~$ python3.3 -m timeit s = 'abcœ…'*1000; n = len(s); flag = s.startswith(('*', 'a')) 10 loops, best of 3: 2.29 usec per loop Once you start doing *real work* with the strings, the overhead of deciding whether they should be stored using 1, 2 or 4 bytes begins to fade into the noise. When I tested this flexible representation, a few months ago, at the first alpha release. This is precisely what, I tested. String manipulations which are forcing this internal change and I concluded the result is not brillant. Realy, a factor 0.n up to 10. Like I said, if you really think that there is a significant, repeatable slow-down on Windows, report it as a bug. Does any body know a way to get the size of the internal string in bytes? sys.getsizeof(some_string) steve@runes:~$ python3.2 -c from sys import getsizeof as size; print(size ('abcœ…'*1000)) 10030 steve@runes:~$ python3.3 -c from sys import getsizeof as size; print(size ('abcœ…'*1000)) 10038 As I said, there is a *tiny* overhead difference. But identifiers will generally be smaller: steve@runes:~$ python3.2 -c from sys import getsizeof as size; print(size (size.__name__)) 48 steve@runes:~$ python3.3 -c from sys import getsizeof as size; print(size (size.__name__)) 34 You can check the object overhead by looking at the size of the empty string. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote: a will be stored as 1 byte/codepoint. Adding é, it will still be stored as 1 byte/codepoint. Wrong. It will be 2 bytes, just like it already is in Python 3.2. I don't know where people are getting this myth that PEP 393 uses Latin-1 internally, it does not. Read the PEP, it explicitly states that 1-byte formats are only used for ASCII strings. Adding €, it will still be stored as 2 bytes/codepoint. That is correct. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 19:59:32 +0100, MRAB wrote: The problem with strings containing surrogate pairs is that you could inadvertently slice the string in the middle of the surrogate pair. That's the *least* of the problems with surrogate pairs. That would be easy to fix: check the point of the slice, and back up or forward if you're on a surrogate pair. But that's not good enough, because the surrogates could be anywhere in the string. You have to touch every single character in order to know how many there are. The problem with surrogate pairs is that they make basic string operations O(N) instead of O(1). -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Top-posting c. (was Re: [ANNC] pybotwar-0.8)
On Sat, 18 Aug 2012 10:27:10 -0700, rusi wrote: For example, my sister recently saw some of my mails and was mystified that I had sent back 'blank mails' until I explained and pointed out that my answers were interleaved into what was originally sent! No offence to your sister, who I'm sure is probably a really great person and kind to small animals and furry children, but didn't she, you know, *investigate further* upon seeing something weird, namely a blank email? As in, Gosh, dearest brother has sent me an email without saying anything. That's weird. I hope he's alright? Maybe there's something a bit further down? Or a funny picture of a cat at the end? Or something? I better scroll down a bit further and see. I'm not talking about complicated tech stuff like View Message Source and trying to determine whether perhaps the MIME type is broken and there's an invisible attachment. I'm talking about almost the simplest thing in the friggin' world, *scrolling down and looking at what's there*. The software equivalent of somebody handing you a blank piece of paper and turning it around to see if maybe there's something on the back. Because that's what I do, and I don't think I'm some sort of hyper- evolved mega-genius with a brain the size of a planet, I'm just some guy. Nobody needed to tell me Hey dummy, the text you are looking for is a bit further down, keep reading. I just looked on my own, and saw the text on my own, and actually read it without being told to, and a little light bulb went on over my head and I went Wow! People can actually write stuff in between other stuff! How did they do that? Now sure, I make allowances for 70 year olds who have never touched a computer before and have to ask What's a scroll bar? and How do I use this mousey-pointer thing? I assume your sister has minimal skills like can scroll and knows how to read. I'm not sure which is worse -- that perhaps I *am* some sort of mega- genius and keep overestimating the difficulty of scroll-down-and-read for normal people, or that others have such short attention spans that anything that they can't see immediately in front of them might as well not exist. Either thought is rather depressing. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()
Steven D'Aprano wrote: On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote: a will be stored as 1 byte/codepoint. Adding é, it will still be stored as 1 byte/codepoint. Wrong. It will be 2 bytes, just like it already is in Python 3.2. I don't know where people are getting this myth that PEP 393 uses Latin-1 internally, it does not. Read the PEP, it explicitly states that 1-byte formats are only used for ASCII strings. From Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC 4.6.1] on linux Type help, copyright, credits or license for more information. import sys [sys.getsizeof(é*i) for i in range(10)] [49, 74, 75, 76, 77, 78, 79, 80, 81, 82] [sys.getsizeof(e*i) for i in range(10)] [49, 50, 51, 52, 53, 54, 55, 56, 57, 58] sys.getsizeof(é*101)-sys.getsizeof(é) 100 sys.getsizeof(e*101)-sys.getsizeof(e) 100 sys.getsizeof(€*101)-sys.getsizeof(€) 200 I infer that (1) both ASCII and Latin1 strings require one byte per character. (2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit system) over ASCII-only. -- http://mail.python.org/mailman/listinfo/python-list
Re: Top-posting c. (was Re: [ANNC] pybotwar-0.8)
On Sun, Aug 19, 2012 at 5:15 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: The software equivalent of somebody handing you a blank piece of paper and turning it around to see if maybe there's something on the back. Straight out of a Goon Show, that is. Heh. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sat, 18 Aug 2012 19:35:44 -0700, Paul Rubin wrote: Scanning 4 characters (or a few dozen, say) to peel off a token in parsing a UTF-8 string is no big deal. It gets more expensive if you want to index far more deeply into the string. I'm asking how often that is done in real code. It happens all the time. Let's say you've got a bunch of text, and you use a regex to scan through it looking for a match. Let's ignore the regular expression engine, since it has to look at every character anyway. But you've done your search and found your matching text and now want everything *after* it. That's not exactly an unusual use-case. mo = re.search(pattern, text) if mo: start, end = mo.span() result = text[end:] Easy-peasy, right? But behind the scenes, you have a problem: how does Python know where text[end:] starts? With fixed-size characters, that's O(1): Python just moves forward end*width bytes into the string. Nice and fast. With a variable-sized characters, Python has to start from the beginning again, and inspect each byte or pair of bytes. This turns the slice operation into O(N) and the combined op (search + slice) into O(N**2), and that starts getting *horrible*. As always, everything is fast for small enough N, but you *really* don't want O(N**2) operations when dealing with large amounts of data. Insisting that the regex functions only ever return offsets to valid character boundaries doesn't help you, because the string slice method cannot know where the indexes came from. I suppose you could have a fast slice and a slow slice method, but really, that sucks, and besides all that does is pass responsibility for tracking character boundaries to the developer instead of the language, and you know damn well that they will get it wrong and their code will silently do the wrong thing and they'll say that Python sucks and we never used to have this problem back in the good old days with ASCII. Boo sucks to that. UCS-4 is an option, since that's fixed-width. But it's also bulky. For typical users, you end up wasting memory. That is the complaint driving PEP 393 -- memory is cheap, but it's not so cheap that you can afford to multiply your string memory by four just in case somebody someday gives you a character in one of the supplementary planes. If you have oodles of memory and small data sets, then UCS-4 is probably all you'll ever need. I hear that the club for people who have all the memory they'll ever need is holding their annual general meeting in a phone-booth this year. You could say Screw the full Unicode standard, who needs more than 64K different characters anyway? Well apart from Asians, and historians, and a bunch of other people. If you can control your data and make sure no non-BMP characters are used, UCS-2 is fine -- except Python doesn't actually use that. You could do what Python 3.2 narrow builds do: use UTF-16 and leave it up to the individual programmer to track character boundaries, and we know how well that works. Luckily the supplementary planes are only rarely used, and people who need them tend to buy more memory and use wide builds. People who only need a few non-BMP characters in a narrow build generally just cross their fingers and hope for the best. You could add a whole lot more heavyweight infrastructure to strings, turn them into suped-up ropes-on-steroids. All those extra indexes mean that you don't save any memory. Because the objects are so much bigger and more complex, your CPU cache goes to the dogs and your code still runs slow. Which leaves us right back where we started, PEP 393. Obviously one can concoct hypothetical examples that would suffer. If you think slicing at arbitrary indexes is a hypothetical example, I don't know what to say. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: This is a long post. If you don't feel like reading an essay, skip to the very bottom and read my last few paragraphs, starting with To recap. I'm very flattered that you took the trouble to write that excellent exposition of different Unicode encodings in response to my post. I can only hope some readers will benefit from it. I regret that I wasn't more clear about the perspective I posted from, i.e. that I'm already familiar with how those encodings work. After reading all of it, I still have the same skepticism on the main point as before, but I think I see what the issue in contention is, and some differences in perspectice. First of all, you wrote: This standard data structure is called UCS-2 ... There's an extension to UCS-2 called UTF-16 My own understanding is UCS-2 simply shouldn't be used any more. Unicode was historically supposed to be a 16-bit character set, but that turned out to not be enough, so the supplementary planes were added. UCS-2 thus became obsolete and UTF-16 superseded it in 1996. UTF-16 in turn is rather clumsy and the later UTF-8 is better in a lot of ways, but both of these are at least capable of encoding all the character codes. On to the main issue: * Variable-byte formats like UTF-8 and UTF-16 mean that basic string operations are not O(1) but are O(N). That means they are slow, or buggy, pick one. This I don't see. What are the basic string operations? * Examine the first character, or first few characters (few = usually bounded by a small constant) such as to parse a token from an input stream. This is O(1) with either encoding. * Slice off the first N characters. This is O(N) with either encoding if it involves copying the chars. I guess you could share references into the same string, but if the slice reference persists while the big reference is released, you end up not freeing the memory until later than you really should. * Concatenate two strings. O(N) either way. * Find length of string. O(1) either way since you'd store it in the string header when you build the string in the first place. Building the string has to have been an O(N) operation in either representation. And finally: * Access the nth char in the string for some large random n, or maybe get a small slice from some random place in a big string. This is where fixed-width representation is O(1) while variable-width is O(N). What I'm not convinced of, is that the last thing happens all that often. Meanwhile, an example of the 393 approach failing: I was involved in a project that dealt with terabytes of OCR data of mostly English text. So the chars were mostly ascii, but there would be occasional non-ascii chars including supplementary plane characters, either because of special symbols that were really in the text, or the typical OCR confusion emitting those symbols due to printing imprecision. That's a natural for UTF-8 but the PEP-393 approach would bloat up the memory requirements by a factor of 4. py s = chr(0x + 1) py a, b = s That looks like Python 3.2 is buggy and that sample should just throw an error. s is a one-character string and should not be unpackable. I realize the folks who designed and implemented PEP 393 are very smart cookies and considered stuff carefully, while I'm just an internet user posting an immediate impression of something I hadn't seen before (I still use Python 2.6), but I still have to ask: if the 393 approach makes sense, why don't other languages do it? Ropes of UTF-8 segments seems like the most obvious approach and I wonder if it was considered. By that I mean pick some implementation constant k (say k=128) and represent the string as a UTF-8 encoded byte array, accompanied by a vector n//k pointers into the byte array, where n is the number of codepoints in the string. Then you can reach any offset analogously to reading a random byte on a disk, by seeking to the appropriate block, and then reading the block and getting the char you want within it. Random access is then O(1) though the constant is higher than it would be with fixed width encoding. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: result = text[end:] if end not near the end of the original string, then this is O(N) even with fixed-width representation, because of the char copying. if it is near the end, by knowing where the string data area ends, I think it should be possible to scan backwards from the end, recognizing what bytes can be the beginning of code points and counting off the appropriate number. This is O(1) if near the end means within a constant. You could say Screw the full Unicode standard, who needs more than 64K No if you're claiming the language supports unicode it should be the whole standard. You could do what Python 3.2 narrow builds do: use UTF-16 and leave it up to the individual programmer to track character boundaries, I'm surprised the Python 3 implementers even considered that approach much less went ahead with it. It's obviously wrong. You could add a whole lot more heavyweight infrastructure to strings, turn them into suped-up ropes-on-steroids. I'm not persuaded that PEP 393 isn't even worse. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 6:11 PM, Paul Rubin no.email@nospam.invalid wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: result = text[end:] if end not near the end of the original string, then this is O(N) even with fixed-width representation, because of the char copying. if it is near the end, by knowing where the string data area ends, I think it should be possible to scan backwards from the end, recognizing what bytes can be the beginning of code points and counting off the appropriate number. This is O(1) if near the end means within a constant. Only if you know exactly where the end is (which requires storing and maintaining a character length - this may already be happening, I don't know). But that approach means you need to have code for both ways (forward search or reverse), and of course it relies on your encoding being reverse-scannable in this way (as UTF-8 is, but not all). And of course, taking the *entire* rest of the string isn't the only thing you do. What if you want to take the next six characters after that index? That would be constant time with a fixed-width storage format. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Chris Angelico ros...@gmail.com writes: And of course, taking the *entire* rest of the string isn't the only thing you do. What if you want to take the next six characters after that index? That would be constant time with a fixed-width storage format. How often is this an issue in practice? I wonder how other languages deal with this. The examples I can think of are poor role models: 1. C/C++ - unicode impaired, other than a wchar type 2. Java - bogus UCS-2-like(?) representation for historical reasons Also has some modified UTF=8 for reasons that made no sense and that I don't remember 3. Haskell - basic string type is a linked list of code points. hello is five list nodes. New Data.Text library (much more efficient) uses something like ropes, I think, with UTF-16 underneath. 4. Erlang - I think like Haskell. Efficiently handles byte blocks. 5. Perl 6 -- ??? 6. Ruby - ??? (but probably quite slow like the rest of Ruby) 7. Objective C -- ??? 8, 9 ... (any other important ones?) -- http://mail.python.org/mailman/listinfo/python-list
Re: Encapsulation, inheritance and polymorphism
On 19/08/2012 06:21, Robert Miles wrote: On 7/23/2012 11:18 AM, Albert van der Horst wrote: In article 5006b48a$0$29978$c3e8da3$54964...@news.astraweb.com, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: SNIP. Even with a break, why bother continuing through the body of the function when you already have the result? When your calculation is done, it's done, just return for goodness sake. You wouldn't write a search that keeps going after you've found the value that you want, out of some misplaced sense that you have to look at every value. Why write code with unnecessary guard values and temporary variables out of a misplaced sense that functions must only have one exit? Example from recipee's: Stirr until the egg white is stiff. Alternative: Stirr egg white for half an hour, but if the egg white is stiff keep your spoon still. (Cooking is not my field of expertise, so the wording may not be quite appropriate. ) -- Steven Groetjes Albert Note that you forgot applying enough heat to do the cooking. Surely the first check is your filing system to make sure that you've paid the utilties bills so you've got gas and or electricity to apply the heat. Either that or you hire Ray Mears to produce the spark needed to light the fire :) -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
About the exemples contested by Steven: eg: timeit.timeit(('ab…' * 10).replace('…', 'œ…')) And it is good enough to show the problem. Period. The rest (you have to do this, you should not do this, why are you using these characters - amazing and stupid question -) does not count. The real problem is elsewhere. *Americans* do not wish a character occupies 4 bytes in *their* memory. The rest of the world does not count. The same thing happens with the utf-8 coding scheme. Technically, it is fine. But after n years of usage, one should recognize it just became an ascii2. Especially for those who undestand nothing in that field and are not even aware, characters are coded. I'm the first to think, this is legitimate. Memory or ability to treat all text in the same and equal way? End note. This kind of discussion is not specific to Python, it always happen when there is some kind of conflict between ascii and non ascii users. Have a nice day. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()
On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote: Steven D'Aprano wrote: I don't know where people are getting this myth that PEP 393 uses Latin-1 internally, it does not. Read the PEP, it explicitly states that 1-byte formats are only used for ASCII strings. From Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC 4.6.1] on linux Type help, copyright, credits or license for more information. import sys [sys.getsizeof(é*i) for i in range(10)] [49, 74, 75, 76, 77, 78, 79, 80, 81, 82] Interesting. Say, I don't suppose you're using a 64-bit build? Because that would explain why your sizes are so larger than mine: py [sys.getsizeof(é*i) for i in range(10)] [25, 38, 39, 40, 41, 42, 43, 44, 45, 46] py [sys.getsizeof(€*i) for i in range(10)] [25, 40, 42, 44, 46, 48, 50, 52, 54, 56] py c = chr(0x + 1) py [sys.getsizeof(c*i) for i in range(10)] [25, 44, 48, 52, 56, 60, 64, 68, 72, 76] On re-reading the PEP more closely, it looks like I did misunderstand the internal implementation, and strings which fit exactly in Latin-1 will also use 1 byte per character. There are three structures used: PyASCIIObject PyCompactUnicodeObject PyUnicodeObject and the third one comes in three variant forms, for 1-byte, 2-byte and 4- byte data. So I stand corrected. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Branch and Bound Algorithm / Module for Python?
Hello everybody, I would like to solve a Mixed Integer Optimization Problem with the Branch-And-Bound Algorithm. I designed my Minimizing function and the constraints. I tested them in a small program in AIMMS. So I already know that they are solvable. Now I want to solve them using Python. Is there a module / methods that I can download or a ready-made program text that you know about, where I can put my constraints and minimization function in? Rebekka -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()
Le dimanche 19 août 2012 10:56:36 UTC+2, Steven D'Aprano a écrit : internal implementation, and strings which fit exactly in Latin-1 will And this is the crucial point. latin-1 is an obsolete and non usable coding scheme (esp. for european languages). We fall on the point I mentionned above. Microsoft know this, ditto for Apple, ditto for TeX, ditto for the foundries. Even, ISO has recognized its error and produced iso-8859-15. The question? Why is it still used? jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
Steven D'Aprano wrote: On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote: Steven D'Aprano wrote: I don't know where people are getting this myth that PEP 393 uses Latin-1 internally, it does not. Read the PEP, it explicitly states that 1-byte formats are only used for ASCII strings. From Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC 4.6.1] on linux Type help, copyright, credits or license for more information. import sys [sys.getsizeof(é*i) for i in range(10)] [49, 74, 75, 76, 77, 78, 79, 80, 81, 82] Interesting. Say, I don't suppose you're using a 64-bit build? Because that would explain why your sizes are so larger than mine: py [sys.getsizeof(é*i) for i in range(10)] [25, 38, 39, 40, 41, 42, 43, 44, 45, 46] py [sys.getsizeof(€*i) for i in range(10)] [25, 40, 42, 44, 46, 48, 50, 52, 54, 56] Yes, I am using a 64-bit build. I thought that (2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit system) over ASCII-only. would convey that. The corresponding data structure typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; char *utf8; Py_ssize_t wstr_length; } PyCompactUnicodeObject; makes for 12 extra bytes on 32 bit, and both Py_ssize_t and pointers double in size (from 4 to 8 bytes) on 64 bit. I'm sure you can do the maths for the embedded PyASCIIObject yourself. -- http://mail.python.org/mailman/listinfo/python-list
Re: Branch and Bound Algorithm / Module for Python?
On Sun, 19 Aug 2012 02:04:20 -0700, Rebekka-Marie wrote: I would like to solve a Mixed Integer Optimization Problem with the Branch-And-Bound Algorithm. [...] Is there a module / methods that I can download or a ready-made program text that you know about, where I can put my constraints and minimization function in? Sounds like it might be something from Numpy or Scipy? http://numpy.scipy.org/ http://www.scipy.org/ This might be useful too: http://telliott99.blogspot.com.au/2010/03/branch-and-bound.html Good luck! If you do find something, come back and tell us please. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 19/08/12 07:09, Steven D'Aprano wrote: This is a long post. If you don't feel like reading an essay, skip to the very bottom and read my last few paragraphs, starting with To recap. Thank you for this excellent post, it has certainly cleared up a few things for me [snip] incidentally But in UTF-16, ... [snip] py s = chr(0x + 1) py a, b = s py a '\ud800' py b '\udc00' in IDLE Python 3.2.3 (default, May 3 2012, 15:51:42) [GCC 4.6.3] on linux2 Type copyright, credits or license() for more information. No Subprocess s = chr(0x + 1) a, b = s Traceback (most recent call last): File pyshell#1, line 1, in module a, b = s ValueError: need more than 1 value to unpack At a terminal prompt [lipska@ubuntu ~]$ python3.2 Python 3.2.3 (default, Jul 17 2012, 14:23:10) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. s = chr(0x + 1) a, b = s a '\ud800' b '\udc00' The date stamp is different but the Python version is the same No idea why this is happening, I just thought it was interesting lipska -- Lipska the Kat©: Troll hunter, sandbox destroyer and farscape dreamer of Aeryn Sun -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 8:13 PM, lipska the kat lipskathe...@yahoo.co.uk wrote: The date stamp is different but the Python version is the same Check out what 'sys.maxunicode' is in each of those Pythons. It's possible that one is a wide build and the other narrow. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
Le dimanche 19 août 2012 11:37:09 UTC+2, Peter Otten a écrit : You know, the techincal aspect is one thing. Understanding the coding of the characters as a whole is something else. The important point is not the coding per se, the relevant point is the set of characters a coding may represent. You can build the most sophisticated mechanism you which, if it does not take that point into account, it will always fail or be not optimal. This is precicely the weak point of this flexible representation. It uses latin-1 and latin-1 is for most users simply unusable. Fascinating, isn't it? Devs are developing sophisticed tools based on a non working basis. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
On Sun, Aug 19, 2012 at 8:19 PM, wxjmfa...@gmail.com wrote: This is precicely the weak point of this flexible representation. It uses latin-1 and latin-1 is for most users simply unusable. No, it uses Unicode, and as an optimization, attempts to store the codepoints in less than four bytes for most strings. The fact that a one-byte storage format happens to look like latin-1 is rather coincidental. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: Branch and Bound Algorithm / Module for Python?
On 19/08/2012 11:04, Steven D'Aprano wrote: On Sun, 19 Aug 2012 02:04:20 -0700, Rebekka-Marie wrote: I would like to solve a Mixed Integer Optimization Problem with the Branch-And-Bound Algorithm. [...] Is there a module / methods that I can download or a ready-made program text that you know about, where I can put my constraints and minimization function in? Sounds like it might be something from Numpy or Scipy? http://numpy.scipy.org/ http://www.scipy.org/ This might be useful too: http://telliott99.blogspot.com.au/2010/03/branch-and-bound.html Good luck! If you do find something, come back and tell us please. In addition to the above there's always the Python Package Index at http://pypi.python.org/pypi -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 19/08/2012 09:54, wxjmfa...@gmail.com wrote: About the exemples contested by Steven: eg: timeit.timeit(('ab…' * 10).replace('…', 'œ…')) And it is good enough to show the problem. Period. The rest (you have to do this, you should not do this, why are you using these characters - amazing and stupid question -) does not count. The real problem is elsewhere. *Americans* do not wish a character occupies 4 bytes in *their* memory. The rest of the world does not count. The same thing happens with the utf-8 coding scheme. Technically, it is fine. But after n years of usage, one should recognize it just became an ascii2. Especially for those who undestand nothing in that field and are not even aware, characters are coded. I'm the first to think, this is legitimate. Memory or ability to treat all text in the same and equal way? End note. This kind of discussion is not specific to Python, it always happen when there is some kind of conflict between ascii and non ascii users. Have a nice day. jmf Roughly translated. I've been shot to pieces and having seen Monty Python and the Holy Grail I know what to do. Run away, run away -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 19/08/12 11:19, Chris Angelico wrote: On Sun, Aug 19, 2012 at 8:13 PM, lipska the kat lipskathe...@yahoo.co.uk wrote: The date stamp is different but the Python version is the same Check out what 'sys.maxunicode' is in each of those Pythons. It's possible that one is a wide build and the other narrow. Ah ... I built my local version from source and no, I didn't read the makefile so I didn't configure for a wide build :-( not that I would have known the difference at that time. [lipska@ubuntu ~]$ python3.2 Python 3.2.3 (default, Jul 17 2012, 14:23:10) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. import sys sys.maxunicode 65535 Later, I did an apt-get install idle3 which pulled down a precompiled IDLE from the Ubuntu repos This was obviously compiled 'wide' Python 3.2.3 (default, May 3 2012, 15:51:42) [GCC 4.6.3] on linux2 Type copyright, credits or license() for more information. No Subprocess import sys sys.maxunicode 1114111 All very interesting and enlightening Thanks lipska -- Lipska the Kat©: Troll hunter, sandbox destroyer and farscape dreamer of Aeryn Sun -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, 19 Aug 2012 01:11:56 -0700, Paul Rubin wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: result = text[end:] if end not near the end of the original string, then this is O(N) even with fixed-width representation, because of the char copying. Technically, yes. But it's a straight copy of a chunk of memory, which means it's fast: your OS and hardware tries to make straight memory copies as fast as possible. Big-Oh analysis frequently glosses over implementation details like that. Of course, that assumption gets shaky when you start talking about extra large blocks, and it falls apart completely when your OS starts paging memory to disk. But if it helps to avoid irrelevant technical details, change it to text[end:end+10] or something. if it is near the end, by knowing where the string data area ends, I think it should be possible to scan backwards from the end, recognizing what bytes can be the beginning of code points and counting off the appropriate number. This is O(1) if near the end means within a constant. You know, I think you are misusing Big-Oh analysis here. It really wouldn't be helpful for me to say Bubble Sort is O(1) if you only sort lists with a single item. Well, yes, that is absolutely true, but that's a special case that doesn't give you any insight into why using Bubble Sort as your general purpose sort routine is a terrible idea. Using variable-sized strings like UTF-8 and UTF-16 for in-memory representations is a terrible idea because you can't assume that people will only every want to index the first or last character. On average, you need to scan half the string, one character at a time. In Big-Oh, we can ignore the factor of 1/2 and just say we scan the string, O(N). That's why languages tend to use fixed character arrays for strings. Haskell is an exception, using linked lists which require traversing the string to jump to an index. The manual even warns: [quote] If you think of a Text value as an array of Char values (which it is not), you run the risk of writing inefficient code. An idiom that is common in some languages is to find the numeric offset of a character or substring, then use that number to split or trim the searched string. With a Text value, this approach would require two O(n) operations: one to perform the search, and one to operate from wherever the search ended. [end quote] http://hackage.haskell.org/packages/archive/text/0.11.2.2/doc/html/Data-Text.html -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Encapsulation, inheritance and polymorphism
On 19/08/12 09:55, Mark Lawrence wrote: On 19/08/2012 06:21, Robert Miles wrote: On 7/23/2012 11:18 AM, Albert van der Horst wrote: In article 5006b48a$0$29978$c3e8da3$54964...@news.astraweb.com, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: [snip] that functions must only have one exit? [snip[ Surely the first check is your filing system to make sure that you've paid the utilties bills so you've got gas and or electricity to apply the heat. Either that or you hire Ray Mears to produce the spark needed to light the fire :) I was wondering how long it would be ... lipska -- Lipska the Kat©: Troll hunter, sandbox destroyer and farscape dreamer of Aeryn Sun -- http://mail.python.org/mailman/listinfo/python-list
Re: Encapsulation, inheritance and polymorphism
On 19/08/2012 12:50, lipska the kat wrote: On 19/08/12 09:55, Mark Lawrence wrote: On 19/08/2012 06:21, Robert Miles wrote: On 7/23/2012 11:18 AM, Albert van der Horst wrote: In article 5006b48a$0$29978$c3e8da3$54964...@news.astraweb.com, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: [snip] that functions must only have one exit? [snip[ Surely the first check is your filing system to make sure that you've paid the utilties bills so you've got gas and or electricity to apply the heat. Either that or you hire Ray Mears to produce the spark needed to light the fire :) I was wondering how long it would be ... lipska Six days shalt thou labour... :) -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit : On Sun, Aug 19, 2012 at 8:19 PM, wxjmfa...@gmail.com wrote: This is precicely the weak point of this flexible representation. It uses latin-1 and latin-1 is for most users simply unusable. No, it uses Unicode, and as an optimization, attempts to store the codepoints in less than four bytes for most strings. The fact that a one-byte storage format happens to look like latin-1 is rather coincidental. And this this is the common basic mistake. You do not push your argumentation far enough. A character may fall accidentally in a latin-1. The problem lies in these european characters, which can not fall in this coding. This *is* the cause of the negative side effects. If you are using a correct coding scheme, like cp1252, mac-roman or iso-8859-15, you will never see such a negative side effect. Again, the problem is not the result, the encoded character. The critical part is the character which may cause this side effect. You should think character set and not encoded code point, considering this kind of expression has a sense in 8-bits coding scheme. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
On 08/19/2012 08:14 AM, wxjmfa...@gmail.com wrote: Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit : On Sun, Aug 19, 2012 at 8:19 PM, wxjmfa...@gmail.com wrote: This is precicely the weak point of this flexible representation. It uses latin-1 and latin-1 is for most users simply unusable. No, it uses Unicode, and as an optimization, attempts to store the codepoints in less than four bytes for most strings. The fact that a one-byte storage format happens to look like latin-1 is rather coincidental. And this this is the common basic mistake. You do not push your argumentation far enough. A character may fall accidentally in a latin-1. The problem lies in these european characters, which can not fall in this coding. This *is* the cause of the negative side effects. If you are using a correct coding scheme, like cp1252, mac-roman or iso-8859-15, you will never see such a negative side effect. Again, the problem is not the result, the encoded character. The critical part is the character which may cause this side effect. You should think character set and not encoded code point, considering this kind of expression has a sense in 8-bits coding scheme. jmf But that choice was made decades ago when Unicode picked its second 128 characters. The internal form used in this PEP is simply the low-order byte of the Unicode code point. Trying to scan the string deciding if converting to cp1252 (for example) would be a much more expensive operation than seeing how many bytes it'd take for the largest code point. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
(pardon the resend, but I accidentally omitted a couple of words) On 08/19/2012 08:14 AM, wxjmfa...@gmail.com wrote: Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit : SNIP No, it uses Unicode, and as an optimization, attempts to store the codepoints in less than four bytes for most strings. The fact that a one-byte storage format happens to look like latin-1 is rather coincidental. And this this is the common basic mistake. You do not push your argumentation far enough. A character may fall accidentally in a latin-1. The problem lies in these european characters, which can not fall in this coding. This *is* the cause of the negative side effects. If you are using a correct coding scheme, like cp1252, mac-roman or iso-8859-15, you will never see such a negative side effect. Again, the problem is not the result, the encoded character. The critical part is the character which may cause this side effect. You should think character set and not encoded code point, considering this kind of expression has a sense in 8-bits coding scheme. jmf But that choice was made decades ago when Unicode picked its second 128 characters. The internal form used in this PEP is simply the low-order byte of the Unicode code point. Trying to scan the string deciding if converting to cp1252 (for example) would work, would be a much more expensive operation than seeing how many bytes it'd take for the largest code point. The 8 bit form is used if all the code points are less than 256. That is a simple description, and simple code. As several people have said, the fact that this byte matches on of the DECODED forms is coincidence. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: Top-posting c. (was Re: [ANNC] pybotwar-0.8)
Hi Steve, I don't think I'm some sort of hyper-evolved mega-genius with a brain the size of a planet, I'm just some guy. Based on reading thousands of your posts over the past 4 years, I'll have to respectfully disagree with you on your assertion that you are not some hyper-evolved genius with a brain the size of a planet. :) I've learned a ton from reading your posts - so much so that I think my brain is getting heavier[1]. Thank you and cheers! Malcolm From a recent thread on this mailing list (hilarious) http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
Le dimanche 19 août 2012 14:29:17 UTC+2, Dave Angel a écrit : On 08/19/2012 08:14 AM, wxjmfa...@gmail.com wrote: Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit : On Sun, Aug 19, 2012 at 8:19 PM, wxjmfa...@gmail.com wrote: This is precicely the weak point of this flexible representation. It uses latin-1 and latin-1 is for most users simply unusable. No, it uses Unicode, and as an optimization, attempts to store the codepoints in less than four bytes for most strings. The fact that a one-byte storage format happens to look like latin-1 is rather coincidental. And this this is the common basic mistake. You do not push your argumentation far enough. A character may fall accidentally in a latin-1. The problem lies in these european characters, which can not fall in this coding. This *is* the cause of the negative side effects. If you are using a correct coding scheme, like cp1252, mac-roman or iso-8859-15, you will never see such a negative side effect. Again, the problem is not the result, the encoded character. The critical part is the character which may cause this side effect. You should think character set and not encoded code point, considering this kind of expression has a sense in 8-bits coding scheme. jmf But that choice was made decades ago when Unicode picked its second 128 characters. The internal form used in this PEP is simply the low-order byte of the Unicode code point. Trying to scan the string deciding if converting to cp1252 (for example) would be a much more expensive operation than seeing how many bytes it'd take for the largest code point. You are absoletely right. (I'm quite comfortable with Unicode). If Python wish to perpetuate this, lets call it, design mistake or ennoyement, it will continue to live with problems. People (tools) who chose pure utf-16 or utf-32 are not suffering from this issue. *My* final comment on this thread. In August 2012, after 20 years of development, Python is not able to display a piece of text correctly on a Windows console (eg cp65001). I downloaded the go language, zero experience, I did not succeed to display incorrecly a piece of text. (This is by the way *the* reason why I tested it). Where the problems are coming from, I have no idea. I find this situation quite comic. Python is able to produce this: (1.1).hex() '0x1.1999ap+0' but it is not able to display a piece of text! Try to convince end users IEEE 754 is more important than the ability to read/wirite a piece a text, a 6-years kid has learned at school :-) (I'm not suffering from this kind of effect, as a Windows user, I'm always working via gui, it still remains, the problem exists. Regards, jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, 19 Aug 2012 01:04:25 -0700, Paul Rubin wrote: Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: This standard data structure is called UCS-2 ... There's an extension to UCS-2 called UTF-16 My own understanding is UCS-2 simply shouldn't be used any more. Pretty much. But UTF-16 with lax support for surrogates (that is, surrogates are included but treated as two characters) is essentially UCS-2 with the restriction against surrogates lifted. That's what Python currently does, and Javascript. http://mathiasbynens.be/notes/javascript-encoding The reality is that support for the Unicode supplementary planes is pretty poor. Even when applications support it, most fonts don't have glyphs for the characters. Anything which makes handling of Unicode supplementary characters better is a step forward. * Variable-byte formats like UTF-8 and UTF-16 mean that basic string operations are not O(1) but are O(N). That means they are slow, or buggy, pick one. This I don't see. What are the basic string operations? The ones I'm specifically referring to are indexing and copying substrings. There may be others. * Examine the first character, or first few characters (few = usually bounded by a small constant) such as to parse a token from an input stream. This is O(1) with either encoding. That's actually O(K), for K = a few, whatever a few means. But we know that anything is fast for small enough N (or K in this case). * Slice off the first N characters. This is O(N) with either encoding if it involves copying the chars. I guess you could share references into the same string, but if the slice reference persists while the big reference is released, you end up not freeing the memory until later than you really should. As a first approximation, memory copying is assumed to be free, or at least constant time. That's not strictly true, but Big Oh analysis is looking at algorithmic complexity. It's not a substitute for actual benchmarks. Meanwhile, an example of the 393 approach failing: I was involved in a project that dealt with terabytes of OCR data of mostly English text. I assume that this wasn't one giant multi-terrabyte string. So the chars were mostly ascii, but there would be occasional non-ascii chars including supplementary plane characters, either because of special symbols that were really in the text, or the typical OCR confusion emitting those symbols due to printing imprecision. That's a natural for UTF-8 but the PEP-393 approach would bloat up the memory requirements by a factor of 4. Not necessarily. Presumably you're scanning each page into a single string. Then only the pages containing a supplementary plane char will be bloated, which is likely to be rare. Especially since I don't expect your OCR application would recognise many non-BMP characters -- what does U+110F3, SORA SOMPENG DIGIT THREE, look like? If the OCR software doesn't recognise it, you can't get it in your output. (If you do, the OCR software has a nasty bug.) Anyway, in my ignorant opinion the proper fix here is to tell the OCR software not to bother trying to recognise Imperial Aramaic, Domino Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't expecting them in your source material. Not only will the scanning go faster, but you'll get fewer wrong characters. [...] I realize the folks who designed and implemented PEP 393 are very smart cookies and considered stuff carefully, while I'm just an internet user posting an immediate impression of something I hadn't seen before (I still use Python 2.6), but I still have to ask: if the 393 approach makes sense, why don't other languages do it? There has to be a first time for everything. Ropes of UTF-8 segments seems like the most obvious approach and I wonder if it was considered. Ropes have been considered and rejected because while they are asymptotically fast, in common cases the added complexity actually makes them slower. Especially for immutable strings where you aren't inserting into the middle of a string. http://mail.python.org/pipermail/python-dev/2000-February/002321.html PyPy has revisited ropes and uses, or at least used, ropes as their native string data structure. But that's ropes of *bytes*, not UTF-8. http://morepypy.blogspot.com.au/2007/11/ropes-branch-merged.html -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
On Sun, 19 Aug 2012 03:19:23 -0700, wxjmfauth wrote: This is precicely the weak point of this flexible representation. It uses latin-1 and latin-1 is for most users simply unusable. That's very funny. Are you aware that your post is entirely Latin-1? Fascinating, isn't it? Devs are developing sophisticed tools based on a non working basis. At the end of the day, PEP 393 fixes some major design limitations of the Unicode implementation in the narrow build Python, while saving memory for people using the wide build. Everybody wins here. Your objection appears to be based on some sort of philosophical objection to Latin-1 than on any genuine problem. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
On 19/08/2012 13:59, wxjmfa...@gmail.com wrote: Le dimanche 19 août 2012 14:29:17 UTC+2, Dave Angel a écrit : On 08/19/2012 08:14 AM, wxjmfa...@gmail.com wrote: Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit : On Sun, Aug 19, 2012 at 8:19 PM, wxjmfa...@gmail.com wrote: This is precicely the weak point of this flexible representation. It uses latin-1 and latin-1 is for most users simply unusable. No, it uses Unicode, and as an optimization, attempts to store the codepoints in less than four bytes for most strings. The fact that a one-byte storage format happens to look like latin-1 is rather coincidental. And this this is the common basic mistake. You do not push your argumentation far enough. A character may fall accidentally in a latin-1. The problem lies in these european characters, which can not fall in this coding. This *is* the cause of the negative side effects. If you are using a correct coding scheme, like cp1252, mac-roman or iso-8859-15, you will never see such a negative side effect. Again, the problem is not the result, the encoded character. The critical part is the character which may cause this side effect. You should think character set and not encoded code point, considering this kind of expression has a sense in 8-bits coding scheme. jmf But that choice was made decades ago when Unicode picked its second 128 characters. The internal form used in this PEP is simply the low-order byte of the Unicode code point. Trying to scan the string deciding if converting to cp1252 (for example) would be a much more expensive operation than seeing how many bytes it'd take for the largest code point. You are absoletely right. (I'm quite comfortable with Unicode). If Python wish to perpetuate this, lets call it, design mistake or ennoyement, it will continue to live with problems. Please give a precise description of the design mistake and what you would do to correct it. People (tools) who chose pure utf-16 or utf-32 are not suffering from this issue. *My* final comment on this thread. In August 2012, after 20 years of development, Python is not able to display a piece of text correctly on a Windows console (eg cp65001). Examples please. I downloaded the go language, zero experience, I did not succeed to display incorrecly a piece of text. (This is by the way *the* reason why I tested it). Where the problems are coming from, I have no idea. I find this situation quite comic. Python is able to produce this: (1.1).hex() '0x1.1999ap+0' but it is not able to display a piece of text! So you keep saying, but when asked for examples or evidence nothing gets produced. Try to convince end users IEEE 754 is more important than the ability to read/wirite a piece a text, a 6-years kid has learned at school :-) (I'm not suffering from this kind of effect, as a Windows user, I'm always working via gui, it still remains, the problem exists. Windows is a law unto itself. Its problems are hardly specific to Python. Regards, jmf Now two or three times you've said you're going but have come back. If you come again could you please provide examples and or evidence of what you're on about, because you still have me baffled. -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
Le dimanche 19 août 2012 15:46:34 UTC+2, Mark Lawrence a écrit : On 19/08/2012 13:59, wxjmfa...@gmail.com wrote: Le dimanche 19 ao�t 2012 14:29:17 UTC+2, Dave Angel a �crit : On 08/19/2012 08:14 AM, wxjmfa...@gmail.com wrote: Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit : On Sun, Aug 19, 2012 at 8:19 PM, wxjmfa...@gmail.com wrote: This is precicely the weak point of this flexible representation. It uses latin-1 and latin-1 is for most users simply unusable. No, it uses Unicode, and as an optimization, attempts to store the codepoints in less than four bytes for most strings. The fact that a one-byte storage format happens to look like latin-1 is rather coincidental. And this this is the common basic mistake. You do not push your argumentation far enough. A character may fall accidentally in a latin-1. The problem lies in these european characters, which can not fall in this coding. This *is* the cause of the negative side effects. If you are using a correct coding scheme, like cp1252, mac-roman or iso-8859-15, you will never see such a negative side effect. Again, the problem is not the result, the encoded character. The critical part is the character which may cause this side effect. You should think character set and not encoded code point, considering this kind of expression has a sense in 8-bits coding scheme. jmf But that choice was made decades ago when Unicode picked its second 128 characters. The internal form used in this PEP is simply the low-order byte of the Unicode code point. Trying to scan the string deciding if converting to cp1252 (for example) would be a much more expensive operation than seeing how many bytes it'd take for the largest code point. You are absoletely right. (I'm quite comfortable with Unicode). If Python wish to perpetuate this, lets call it, design mistake or ennoyement, it will continue to live with problems. Please give a precise description of the design mistake and what you would do to correct it. People (tools) who chose pure utf-16 or utf-32 are not suffering from this issue. *My* final comment on this thread. In August 2012, after 20 years of development, Python is not able to display a piece of text correctly on a Windows console (eg cp65001). Examples please. I downloaded the go language, zero experience, I did not succeed to display incorrecly a piece of text. (This is by the way *the* reason why I tested it). Where the problems are coming from, I have no idea. I find this situation quite comic. Python is able to produce this: (1.1).hex() '0x1.1999ap+0' but it is not able to display a piece of text! So you keep saying, but when asked for examples or evidence nothing gets produced. Try to convince end users IEEE 754 is more important than the ability to read/wirite a piece a text, a 6-years kid has learned at school :-) (I'm not suffering from this kind of effect, as a Windows user, I'm always working via gui, it still remains, the problem exists. Windows is a law unto itself. Its problems are hardly specific to Python. Regards, jmf Now two or three times you've said you're going but have come back. If you come again could you please provide examples and or evidence of what you're on about, because you still have me baffled. -- Cheers. Mark Lawrence. Yesterday, I went to bed. More seriously. I can not give you more numbers than those I gave. As a end user, I noticed and experimented my random tests are always slower in Py3.3 than in Py3.2 on my Windows platform. It is up to you, the core developers to give an explanation about this behaviour. As I understand a little bit the coding of the characters, I pointed out, this is most probably due to this flexible string representation (with arguments appearing randomly in the misc. messages, mainly latin-1). I can not do more. (I stupidly spoke about factors 0.1 to ..., you should read of course, 1.1, to ...) jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
On 19 August 2012 15:09, wxjmfa...@gmail.com wrote: I can not give you more numbers than those I gave. As a end user, I noticed and experimented my random tests are always slower in Py3.3 than in Py3.2 on my Windows platform. Do the problems have a significant impact on any real application (rather than random tests)? Any significant change in implementation such as this is likely to have both positive and negative performance costs. The important thing is how it affects a real application as a whole. It is up to you, the core developers to give an explanation about this behaviour. Unless others are unable to reproduce your observations. If there is a big performance hit for text heavy applications then it's worth reporting but you should focus your energy on distilling a *meaningful* test case (rather than ranting about Americans, unicode, latin-1 and so on). Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
On 19/08/2012 15:09, wxjmfa...@gmail.com wrote: I can not give you more numbers than those I gave. As a end user, I noticed and experimented my random tests are always slower in Py3.3 than in Py3.2 on my Windows platform. Once again you refuse to supply anything to back up what you say. It is up to you, the core developers to give an explanation about this behaviour. Core developers cannot give an explanation for something that doesn't exist, except in your imagination. Unless you can produce the evidence that supports your claims, including details of OS, benchmarks used and so on and so forth. As I understand a little bit the coding of the characters, I pointed out, this is most probably due to this flexible string representation (with arguments appearing randomly in the misc. messages, mainly latin-1). I can not do more. (I stupidly spoke about factors 0.1 to ..., you should read of course, 1.1, to ...) jmf I suspect that I'll be dead and buried long before you can produce anything concrete in the way of evidence. I've thrown down the gauntlet several times, do you now have the courage to pick it up, or are you going to resort to the FUD approach that you've been using throughout this thread? -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 19/08/12 15:25, Steven D'Aprano wrote: Not necessarily. Presumably you're scanning each page into a single string. Then only the pages containing a supplementary plane char will be bloated, which is likely to be rare. Especially since I don't expect your OCR application would recognise many non-BMP characters -- what does U+110F3, SORA SOMPENG DIGIT THREE, look like? If the OCR software doesn't recognise it, you can't get it in your output. (If you do, the OCR software has a nasty bug.) Anyway, in my ignorant opinion the proper fix here is to tell the OCR software not to bother trying to recognise Imperial Aramaic, Domino Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't expecting them in your source material. Not only will the scanning go faster, but you'll get fewer wrong characters. Consider the automated recognition of a CAPTCHA. As the chars have to be entered by the user on a keyboard, only the most basic charset can be used, so the problem of which chars are possible is quite limited. -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
Le dimanche 19 août 2012 16:48:48 UTC+2, Mark Lawrence a écrit : On 19/08/2012 15:09, wxjmfa...@gmail.com wrote: I can not give you more numbers than those I gave. As a end user, I noticed and experimented my random tests are always slower in Py3.3 than in Py3.2 on my Windows platform. Once again you refuse to supply anything to back up what you say. It is up to you, the core developers to give an explanation about this behaviour. Core developers cannot give an explanation for something that doesn't exist, except in your imagination. Unless you can produce the evidence that supports your claims, including details of OS, benchmarks used and so on and so forth. As I understand a little bit the coding of the characters, I pointed out, this is most probably due to this flexible string representation (with arguments appearing randomly in the misc. messages, mainly latin-1). I can not do more. (I stupidly spoke about factors 0.1 to ..., you should read of course, 1.1, to ...) jmf I suspect that I'll be dead and buried long before you can produce anything concrete in the way of evidence. I've thrown down the gauntlet several times, do you now have the courage to pick it up, or are you going to resort to the FUD approach that you've been using throughout this thread? -- Cheers. Mark Lawrence. I do not remember the tests I'have done at the 1st alpha release time. It was with an interactive interpreter. I precisely pay attention to test these chars you can find in the range 128..256 in all 8-bits coding schemes. Chars I suspected to be problematic. Here a short test again, a random single test, the first idea coming in my mind. Py 3.2.3 timeit.timeit(('aœ€'*100).replace('a', 'œ€é')) 4.99396356635981 Py 3.3b2 timeit.timeit(('aœ€'*100).replace('a', 'œ€é')) 7.560455708007855 Maybe, not so demonstative. It shows at least, we are far away from the 10-30% annouced. 7.56 / 5 1.512 5 / (7.56 - 5) * 100 195.312503 jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/19/2012 4:54 AM, wxjmfa...@gmail.com wrote: About the exemples contested by Steven: eg: timeit.timeit(('ab…' * 10).replace('…', 'œ…')) And it is good enough to show the problem. Period. Repeating a false claim over and over does not make it true. Two people on pydev claim that 3.3 is *faster* on their systems (one unspecified, one OSX10.8). -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
How does .rjust() work and why it places characters relative to previous one, not to first character - placed most to left - or to left side of screen?
I have an example: def pairwiseScore(seqA, seqB): prev = -1 score = 0 length = len(seqA) similarity = [] relative_similarity = [] for x in xrange(length): if seqA[x] == seqB[x]: if (x = 1) and (seqA[x - 1] == seqB[x - 1]): score += 3 similarity.append(x) else: score += 1 similarity.append(x) else: score -= 1 for x in similarity: relative_similarity.append(x - prev) prev = x return ''.join((seqA, '\n', ''.join(['|'.rjust(x) for x in relative_similarity]), '\n', seqB, '\n', 'Score: ', str(score))) print pairwiseScore(ATTCGT, ATCTAT), '\n', '\n', pairwiseScore(GATAAATCTGGTCT, CATTCATCATGCAA), '\n', '\n', pairwiseScore('AGCG', 'ATCG'), '\n', '\n', pairwiseScore('ATCG', 'ATCG') which returns: ATTCGT || | ATCTAT Score: 2 GATAAATCTGGTCT || ||| | CATTCATCATGCAA Score: 4 AGCG | || ATCG Score: 4 ATCG ATCG Score: 10 But i created this with some help from one person. Earlier, this code was devoided of these few lines: prev = -1 relative_similarity = [] for x in similarity: relative_similarity.append(x - prev) prev = x The method looked liek this: def pairwiseScore(seqA, seqB): score = 0 length = len(seqA) similarity = [] for x in xrange(length): if seqA[x] == seqB[x]: if (x = 1) and (seqA[x - 1] == seqB[x - 1]): score += 3 similarity.append(x) else: score += 1 similarity.append(x) else: score -= 1 return ''.join((seqA, '\n', ''.join(['|'.rjust(x) for x in similarity]), '\n', seqB, '\n', 'Score: ', str(score))) and produced this output: ATTCGT ||| ATCTAT Score: 2 GATAAATCTGGTCT | || | | | CATTCATCATGCAA Score: 4 AGCG | | | ATCG Score: 4 ATCG || | | ATCG Score: 10 So I have guessed, that characters processed by .rjust() function, are placed in output, relative to previous ones - NOT to first, most to left placed, character. Why it works like that? What builtn-in function can format output, to make every character be placed as i need - relative to the first character, placed most to left side of screen. Cheers -- http://mail.python.org/mailman/listinfo/python-list
Re: How does .rjust() work and why it places characters relative to previous one, not to first character - placed most to left - or to left side of screen?
Here's first code - http://codepad.org/RcKTTiYa And here's second - http://codepad.org/zwEQKKeV -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
On Aug 19, 2012 5:22 PM, wxjmfa...@gmail.com wrote Py 3.2.3 timeit.timeit(('aœ€'*100).replace('a', 'œ€é')) 4.99396356635981 Py 3.3b2 timeit.timeit(('aœ€'*100).replace('a', 'œ€é')) 7.560455708007855 Maybe, not so demonstative. It shows at least, we are far away from the 10-30% annouced. 7.56 / 5 1.512 5 / (7.56 - 5) * 100 195.312503 Maybe the problem is that your understanding of a percentage differs from that of others. I make that a 51% increase. I don't really understand what your 195 figure is demonstrating. Oscar. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Steven D'Aprano wrote in message news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com... On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote: [...] If you can consistently replicate a 100% to 1000% slowdown in string handling, please report it as a performance bug: http://bugs.python.org/ Don't forget to report your operating system. For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) running Windows 7 x64. Running Python from a Windows command prompt, I got the following on Python 3.2.3 and 3.3 beta 2: python33\python -m timeit ('abc' * 1000).replace('c', 'de') 1 loops, best of 3: 39.3 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', '……') 1 loops, best of 3: 51.8 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', 'x…') 1 loops, best of 3: 52 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…') 1 loops, best of 3: 50.3 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', '€…') 1 loops, best of 3: 51.6 usec per loop python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç') 1 loops, best of 3: 38.3 usec per loop python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?') 1 loops, best of 3: 50.3 usec per loop python32\python -m timeit ('abc' * 1000).replace('c', 'de') 1 loops, best of 3: 24.5 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', '……') 1 loops, best of 3: 24.7 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', 'x…') 1 loops, best of 3: 24.8 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…') 1 loops, best of 3: 24 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', '€…') 1 loops, best of 3: 24.1 usec per loop python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç') 1 loops, best of 3: 24.4 usec per loop python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?') 1 loops, best of 3: 24.3 usec per loop This is an average slowdown by a factor of close to 2.3 on 3.3 when compared with 3.2. I am not posting this to perpetuate this thread but simply to ask whether, as you suggest, I should report this as a possible problem with the beta? -- http://mail.python.org/mailman/listinfo/python-list
Re: How does .rjust() work and why it places characters relative to previous one, not to first character - placed most to left - or to left side of screen?
On 08/19/2012 12:25 PM, crispy wrote: SNIP So I have guessed, that characters processed by .rjust() function, are placed in output, relative to previous ones - NOT to first, most to left placed, character. rjust() does not print to the console, it just produces a string. So if you want to know how it works, you need to either read about it, or experiment with it. Try help(.rjust) to see a simple description of it. (If you're not familiar with the interactive interpreter's help() function, you owe it to yourself to learn it). Playing with it: print abcd.rjust(8, -) producesabcd for i in range(5): print a.rjust(i, -) produces: a a -a --a ---a In each case, the number of characters produced is no larger than i. No consideration is made to other strings outside of the literal passed into the method. Why it works like that? In your code, you have the rjust() method inside a loop, inside a join, inside a print. it makes a nice, impressive single line, but clearly you don't completely understand what the pieces are, nor how they work together. Since the join is combining (concatenating) strings that are each being produced by rjust(), it's the join() that's making this look relative to you. What builtn-in function can format output, to make every character be placed as i need - relative to the first character, placed most to left side of screen. If you want to randomly place characters on the screen, you either want a curses-like package, or a gui. i suspect that's not at all what you want. if you want to randomly change characters in a pre-existing string, which will then be printed to the console, then I could suggest an approach (untested) res = [ ] * length for column in similarity: res[column] = | res = .join(res) -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/19/2012 4:04 AM, Paul Rubin wrote: Meanwhile, an example of the 393 approach failing: I am completely baffled by this, as this example is one where the 393 approach potentially wins. I was involved in a project that dealt with terabytes of OCR data of mostly English text. So the chars were mostly ascii, 3.3 stores ascii pages 1 byte/char rather than 2 or 4. but there would be occasional non-ascii chars including supplementary plane characters, either because of special symbols that were really in the text, or the typical OCR confusion emitting those symbols due to printing imprecision. I doubt that there are really any non-bmp chars. As Steven said, reject such false identifications. That's a natural for UTF-8 3.3 would convert to utf-8 for storage on disk. but the PEP-393 approach would bloat up the memory requirements by a factor of 4. 3.2- wide builds would *always* use 4 bytes/char. Is not occasionally better than always? py s = chr(0x + 1) py a, b = s That looks like Python 3.2 is buggy and that sample should just throw an error. s is a one-character string and should not be unpackable. That looks like a 3.2- narrow build. Such which treat unicode strings as sequences of code units rather than sequences of codepoints. Not an implementation bug, but compromise design that goes back about a decade to when unicode was added to Python. At that time, there were only a few defined non-BMP chars and their usage was extremely rare. There are now more extended chars than BMP chars and usage will become more common even in English text. Pre 3.3, there are really 2 sub-versions of every Python version: a narrow build and a wide build version, with not very well documented different behaviors for any string with extended chars. That is and would have become an increasing problem as extended chars are increasingly used. If you want to say that what was once a practical compromise has become a design bug, I would not argue. In any case, 3.3 fixes that split and returns Python to being one cross-platform language. I realize the folks who designed and implemented PEP 393 are very smart cookies and considered stuff carefully, while I'm just an internet user posting an immediate impression of something I hadn't seen before (I still use Python 2.6), but I still have to ask: if the 393 approach makes sense, why don't other languages do it? Python has often copied or borrowed, with adjustments. This time it is the first. We will see how it goes, but it has been tested for nearly a year already. Ropes of UTF-8 segments seems like the most obvious approach and I wonder if it was considered. By that I mean pick some implementation constant k (say k=128) and represent the string as a UTF-8 encoded byte array, accompanied by a vector n//k pointers into the byte array, where n is the number of codepoints in the string. Then you can reach any offset analogously to reading a random byte on a disk, by seeking to the appropriate block, and then reading the block and getting the char you want within it. Random access is then O(1) though the constant is higher than it would be with fixed width encoding. I would call it O(k), where k is a selectable constant. Slowing access by a factor of 100 is hardly acceptable to me. For strings less than k, access is O(len). I believe slicing would require re-indexing. As 393 was near adoption, I proposed a scheme using utf-16 (narrow builds) with a supplementary index of extended chars when there are any. That makes access O(1) if there are none and O(log(k)), where k is the number of extended chars in the string, if there are some. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le dimanche 19 août 2012 19:03:34 UTC+2, Blind Anagram a écrit : Steven D'Aprano wrote in message news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com... On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote: [...] If you can consistently replicate a 100% to 1000% slowdown in string handling, please report it as a performance bug: http://bugs.python.org/ Don't forget to report your operating system. For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) running Windows 7 x64. Running Python from a Windows command prompt, I got the following on Python 3.2.3 and 3.3 beta 2: python33\python -m timeit ('abc' * 1000).replace('c', 'de') 1 loops, best of 3: 39.3 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', '……') 1 loops, best of 3: 51.8 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', 'x…') 1 loops, best of 3: 52 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…') 1 loops, best of 3: 50.3 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', '€…') 1 loops, best of 3: 51.6 usec per loop python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç') 1 loops, best of 3: 38.3 usec per loop python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?') 1 loops, best of 3: 50.3 usec per loop python32\python -m timeit ('abc' * 1000).replace('c', 'de') 1 loops, best of 3: 24.5 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', '……') 1 loops, best of 3: 24.7 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', 'x…') 1 loops, best of 3: 24.8 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…') 1 loops, best of 3: 24 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', '€…') 1 loops, best of 3: 24.1 usec per loop python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç') 1 loops, best of 3: 24.4 usec per loop python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?') 1 loops, best of 3: 24.3 usec per loop This is an average slowdown by a factor of close to 2.3 on 3.3 when compared with 3.2. I am not posting this to perpetuate this thread but simply to ask whether, as you suggest, I should report this as a possible problem with the beta? I use win7 pro 32bits in intel? Thanks for reporting these numbers. To be clear: I'm not complaining, but the fact that there is a slow down is a clear indication (in my mind), there is a point somewhere. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
On 8/19/2012 10:09 AM, wxjmfa...@gmail.com wrote: I can not give you more numbers than those I gave. As a end user, I noticed and experimented my random tests are always slower in Py3.3 than in Py3.2 on my Windows platform. And I gave other examples where 3.3 is *faster* on my Windows, which you have thus far not even acknowledged, let alone try. It is up to you, the core developers to give an explanation about this behaviour. System variation, unimportance of sub-microsecond variations, and attention to more important issues. Other developer say 3.3 is generally faster on their sy stems (OSX 10.8, and unspecified). To talk about speed sensibly, one must run the full stringbench.py benchmark and real applications on multiple Windows, *nix, and Mac systems. Python is not optimized for your particular current computer. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Terry Reedy tjre...@udel.edu writes: Meanwhile, an example of the 393 approach failing: I am completely baffled by this, as this example is one where the 393 approach potentially wins. What? The 393 approach is supposed to avoid memory bloat and that does the opposite. I was involved in a project that dealt with terabytes of OCR data of mostly English text. So the chars were mostly ascii, 3.3 stores ascii pages 1 byte/char rather than 2 or 4. But they are not ascii pages, they are (as stated) MOSTLY ascii. E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses a much more memory-expensive encoding than UTF-8. I doubt that there are really any non-bmp chars. You may be right about this. I thought about it some more after posting and I'm not certain that there were supplemental characters. As Steven said, reject such false identifications. Reject them how? That's a natural for UTF-8 3.3 would convert to utf-8 for storage on disk. They are already in utf-8 on disk though that doesn't matter since they are also compressed. but the PEP-393 approach would bloat up the memory requirements by a factor of 4. 3.2- wide builds would *always* use 4 bytes/char. Is not occasionally better than always? The bloat is in comparison with utf-8, in that example. That looks like a 3.2- narrow build. Such which treat unicode strings as sequences of code units rather than sequences of codepoints. Not an implementation bug, but compromise design that goes back about a decade to when unicode was added to Python. I thought the whole point of Python 3's disruptive incompatibility with Python 2 was to clean up past mistakes and compromises, of which unicode headaches was near the top of the list. So I'm surprised they seem to repeated a mistake there. I would call it O(k), where k is a selectable constant. Slowing access by a factor of 100 is hardly acceptable to me. If k is constant then O(k) is the same as O(1). That is how O notation works. I wouldn't believe the 100x figure without seeing it measured in real-world applications. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393: There is some additional benefit for Latin-1 users, but this has nothing to do with Python. If Python is going to have the option of a 1-byte representation (and as long as we have the flexible representation, I can see no reason not to), The PEP explicitly states that it only uses a 1-byte format for ASCII strings, not Latin-1: I think you misunderstand the PEP then, because that is empirically false. Python 3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:23:35) [MSC v.1600 64 bit (AMD64)] on win32 Type help, copyright, credits or license for more information. import sys sys.getsizeof(bytes(range(256)).decode('latin1')) 329 The constructed string contains all 256 Latin-1 characters, so if Latin-1 strings must be stored in the 2-byte format, then the size should be at least 512 bytes. It is not, so I think it must be using the 1-byte encoding. ASCII-only Unicode strings will again use only one byte per character This says nothing one way or the other about non-ASCII Latin-1 strings. If the maximum character is less than 128, they use the PyASCIIObject structure Note that this only describes the structure of compact string objects, which I have to admit I do not fully understand from the PEP. The wording suggests that it only uses the PyASCIIObject structure, not the derived structures. It then says that for compact ASCII strings the UTF-8 data, the UTF-8 length and the wstr length are the same as the length of the ASCII data. But these fields are part of the PyCompactUnicodeObject structure, not the base PyASCIIObject structure, so they would not exist if only PyASCIIObject were used. It would also imply that compact non-ASCII strings are stored internally as UTF-8, which would be surprising. and: The data and utf8 pointers point to the same memory if the string uses only ASCII characters (using only Latin-1 is not sufficient). This says that if the data are ASCII, then the 1-byte representation and the utf8 pointer will share the same memory. It does not imply that the 1-byte representation is not used for Latin-1, only that it cannot also share memory with the utf8 pointer. -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
Just for the story. Five minutes after a closed my interactive interpreters windows, the day I tested this stuff. I though: Too bad I did not noted the extremely bad cases I found, I'm pretty sure, this problem will arrive on the table. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
On 8/19/2012 8:59 AM, wxjmfa...@gmail.com wrote: In August 2012, after 20 years of development, Python is not able to display a piece of text correctly on a Windows console (eg cp65001). cp65001 is known to not work right. It has been very frustrating. Bug Microsoft about it, and indeed their whole policy of still dividing the world into code page regions, even in their next version, instead of moving toward unicode and utf-8, at least as an option. I downloaded the go language, zero experience, I did not succeed to display incorrecly a piece of text. (This is by the way *the* reason why I tested it). Where the problems are coming from, I have no idea. If go can display all unicode chars on a Windows console, perhaps you can do some research and find out how they do so. Then we could consider copying it. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
wrote in message news:5dfd1779-9442-4858-9161-8f1a06d56...@googlegroups.com... Le dimanche 19 août 2012 19:03:34 UTC+2, Blind Anagram a écrit : Steven D'Aprano wrote in message news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com... On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote: [...] If you can consistently replicate a 100% to 1000% slowdown in string handling, please report it as a performance bug: http://bugs.python.org/ Don't forget to report your operating system. For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) running Windows 7 x64. Running Python from a Windows command prompt, I got the following on Python 3.2.3 and 3.3 beta 2: python33\python -m timeit ('abc' * 1000).replace('c', 'de') 1 loops, best of 3: 39.3 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', '……') 1 loops, best of 3: 51.8 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', 'x…') 1 loops, best of 3: 52 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…') 1 loops, best of 3: 50.3 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', '€…') 1 loops, best of 3: 51.6 usec per loop python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç') 1 loops, best of 3: 38.3 usec per loop python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?') 1 loops, best of 3: 50.3 usec per loop python32\python -m timeit ('abc' * 1000).replace('c', 'de') 1 loops, best of 3: 24.5 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', '……') 1 loops, best of 3: 24.7 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', 'x…') 1 loops, best of 3: 24.8 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…') 1 loops, best of 3: 24 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', '€…') 1 loops, best of 3: 24.1 usec per loop python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç') 1 loops, best of 3: 24.4 usec per loop python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?') 1 loops, best of 3: 24.3 usec per loop This is an average slowdown by a factor of close to 2.3 on 3.3 when compared with 3.2. I am not posting this to perpetuate this thread but simply to ask whether, as you suggest, I should report this as a possible problem with the beta? I use win7 pro 32bits in intel? Thanks for reporting these numbers. To be clear: I'm not complaining, but the fact that there is a slow down is a clear indication (in my mind), there is a point somewhere. I may be reading your input wrongly, but it seems to me that you are not only reporting a slowdown but you are also suggesting that this slowdown is the result of bad design decisions by the Python development team. I don't want to get involved in the latter part of your argument because I am convinced that the Python team are doing their very best to find a good compromise between the various design constraints that they face in meeting these needs. Nevertheless, the post that I responded to contained the suggestion that slowdowns above 100% (which I took as a factor of 2) would be worth reporting as a possible bug. So I thought that it was worth asking about this as I may have misunderstood the level of slowdown that is worth reporting. There is also a potential problem in timings on laptops with turbo-boost (as I have), although the times look fairly consistent. -- http://mail.python.org/mailman/listinfo/python-list
Re: New image and color management library for Python 2+3
On 14.08.2012 21:22, Christian Heimes wrote: Hello fellow Pythonistas, Performance === smc.freeimage with libjpeg-turbo read JPEGs about three to six times faster than PIL and writes JPEGs more than five times faster. [] Python 2.7.3 read / write cycles: 300 test image: 1210x1778 24bpp JPEG (pon.jpg) platform: Ubuntu 12.04 X86_64 hardware: Intel Xeon hexacore W3680@3.33GHz with 24 GB RAM smc.freeimage, FreeImage 3.15.3 standard - read JPEG 12.857 sec - read JPEG 6.629 sec (resaved) - write JPEG 21.817 sec smc.freeimage, FreeImage 3.15.3 with jpeg turbo - read JPEG 9.297 sec - read JPEG 3.909 sec (resaved) - write JPEG 5.857 sec - read LZW TIFF 17.947 sec - read biton G4 TIFF 2.068 sec - resize 3.850 sec (box) - resize 5.022 sec (bilinear) - resize 7.942 sec (bspline) - resize 7.222 sec (bicubic) - resize 7.941 sec (catmull rom spline) - resize 10.232 sec (lanczos3) - tiff numpy.asarray() with bytescale() 0.006 sec - tiff load + numpy.asarray() with bytescale() 18.043 sec PIL 1.1.7 - read JPEG 30.389 sec - read JPEG 23.118 sec (resaved) - write JPEG 34.405 sec - read LZW TIFF 21.596 sec - read biton G4 TIFF: decoder group4 not available - resize 0.032 sec (nearest) - resize 1.074 sec (bilinear) - resize 2.924 sec (bicubic) - resize 8.056 sec (antialias) - tiff scipy fromimage() with bytescale() 1.165 sec - tiff scipy imread() with bytescale() 22.939 sec Christian Hello Christian, I'm sorry for getting out of your initial question/request, but did you try out ImageMagick before making use of FreeImage - do you even perhaps can deliver a comparison between your project and ImageMagick (if regular Python is used)? I ask cause: Im in the process of creating a web-app which also requires image processing and just switching from PIL (because it is unfortunately not that quick as it should be) to ImageMagick and the speeds are much better compared to it, but I didn't take measurements of that. Can you perhaps test your solution with ImageMagick (as it is used widely) it would be interesting so. :) But no offence by that and respect for you work so! Jan -- http://mail.python.org/mailman/listinfo/python-list
Re: Branch and Bound Algorithm / Module for Python?
On 8/19/2012 5:04 AM, Rebekka-Marie wrote: Hello everybody, I would like to solve a Mixed Integer Optimization Problem with the Branch-And-Bound Algorithm. I designed my Minimizing function and the constraints. I tested them in a small program in AIMMS. So I already know that they are solvable. Now I want to solve them using Python. Is there a module / methods that I can download or a ready-made program text that you know about, where I can put my constraints and minimization function in? Search 'Python constraint solver' and you should find at least two programs. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 08/19/2012 01:03 PM, Blind Anagram wrote: Steven D'Aprano wrote in message news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com... On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote: [...] If you can consistently replicate a 100% to 1000% slowdown in string handling, please report it as a performance bug: http://bugs.python.org/ Don't forget to report your operating system. For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) running Windows 7 x64. Running Python from a Windows command prompt, I got the following on Python 3.2.3 and 3.3 beta 2: python33\python -m timeit ('abc' * 1000).replace('c', 'de') 1 loops, best of 3: 39.3 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', '……') 1 loops, best of 3: 51.8 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', 'x…') 1 loops, best of 3: 52 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…') 1 loops, best of 3: 50.3 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', '€…') 1 loops, best of 3: 51.6 usec per loop python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç') 1 loops, best of 3: 38.3 usec per loop python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?') 1 loops, best of 3: 50.3 usec per loop python32\python -m timeit ('abc' * 1000).replace('c', 'de') 1 loops, best of 3: 24.5 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', '……') 1 loops, best of 3: 24.7 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', 'x…') 1 loops, best of 3: 24.8 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…') 1 loops, best of 3: 24 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', '€…') 1 loops, best of 3: 24.1 usec per loop python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç') 1 loops, best of 3: 24.4 usec per loop python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?') 1 loops, best of 3: 24.3 usec per loop This is an average slowdown by a factor of close to 2.3 on 3.3 when compared with 3.2. Using your measurement numbers, I get an average of 1.95, not 2.3 -- DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
On 19/08/2012 18:51, wxjmfa...@gmail.com wrote: Just for the story. Five minutes after a closed my interactive interpreters windows, the day I tested this stuff. I though: Too bad I did not noted the extremely bad cases I found, I'm pretty sure, this problem will arrive on the table. jmf How convenient. -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable usingord()
Dave Angel wrote in message news:mailman.3519.1345399574.4697.python-l...@python.org... [...] This is an average slowdown by a factor of close to 2.3 on 3.3 when compared with 3.2. Using your measurement numbers, I get an average of 1.95, not 2.3 Yes - you are right - my apologies. But it is close enough to 2 to still be worth asking. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit : But they are not ascii pages, they are (as stated) MOSTLY ascii. E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses a much more memory-expensive encoding than UTF-8. Imagine an us banking application, everything in ascii, except ... the € currency symbole, code point 0x20ac. Well, it seems some software producers know what they are doing. '€'.encode('cp1252') b'\x80' '€'.encode('mac-roman') b'\xdb' '€'.encode('iso-8859-1') Traceback (most recent call last): File eta last command, line 1, in module UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256) jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Ian Kelly ian.g.ke...@gmail.com writes: sys.getsizeof(bytes(range(256)).decode('latin1')) 329 Please try: print (type(bytes(range(256)).decode('latin1'))) to make sure that what comes back is actually a unicode string rather than a byte string. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 12:20 PM, Paul Rubin no.email@nospam.invalid wrote: Ian Kelly ian.g.ke...@gmail.com writes: sys.getsizeof(bytes(range(256)).decode('latin1')) 329 Please try: print (type(bytes(range(256)).decode('latin1'))) to make sure that what comes back is actually a unicode string rather than a byte string. As I understand it, the decode method never returns a byte string in Python 3, but if you insist: print (type(bytes(range(256)).decode('latin1'))) class 'str' -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, Aug 19, 2012 at 11:50 AM, Ian Kelly ian.g.ke...@gmail.com wrote: Note that this only describes the structure of compact string objects, which I have to admit I do not fully understand from the PEP. The wording suggests that it only uses the PyASCIIObject structure, not the derived structures. It then says that for compact ASCII strings the UTF-8 data, the UTF-8 length and the wstr length are the same as the length of the ASCII data. But these fields are part of the PyCompactUnicodeObject structure, not the base PyASCIIObject structure, so they would not exist if only PyASCIIObject were used. It would also imply that compact non-ASCII strings are stored internally as UTF-8, which would be surprising. Oh, now I get it. I had missed the part where it says character data immediately follow the base structure. And the bit about the UTF-8 data, the UTF-8 length and the wstr length are not describing the contents of those fields, but rather where the data can be alternatively found since the fields don't exist. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 19/08/2012 19:11, wxjmfa...@gmail.com wrote: Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit : But they are not ascii pages, they are (as stated) MOSTLY ascii. E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses a much more memory-expensive encoding than UTF-8. Imagine an us banking application, everything in ascii, except ... the € currency symbole, code point 0x20ac. Well, it seems some software producers know what they are doing. '€'.encode('cp1252') b'\x80' '€'.encode('mac-roman') b'\xdb' '€'.encode('iso-8859-1') Traceback (most recent call last): File eta last command, line 1, in module UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256) jmf Well that's it then, the world stock markets will all collapse tonight when the news leaks out that those stupid Americans haven't yet realised that much of Europe (with at least one very noticeable and sensible exception :) uses Euros. I'd better sell all my stock holdings fast. -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
Ian Kelly ian.g.ke...@gmail.com writes: print (type(bytes(range(256)).decode('latin1'))) class 'str' Thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: How does .rjust() work and why it places characters relative to previous one, not to first character - placed most to left - or to left side of screen?
W dniu niedziela, 19 sierpnia 2012 19:31:30 UTC+2 użytkownik Dave Angel napisał: On 08/19/2012 12:25 PM, crispy wrote: SNIP So I have guessed, that characters processed by .rjust() function, are placed in output, relative to previous ones - NOT to first, most to left placed, character. rjust() does not print to the console, it just produces a string. So if you want to know how it works, you need to either read about it, or experiment with it. Try help(.rjust) to see a simple description of it. (If you're not familiar with the interactive interpreter's help() function, you owe it to yourself to learn it). Playing with it: print abcd.rjust(8, -) producesabcd for i in range(5): print a.rjust(i, -) produces: a a -a --a ---a In each case, the number of characters produced is no larger than i. No consideration is made to other strings outside of the literal passed into the method. Why it works like that? In your code, you have the rjust() method inside a loop, inside a join, inside a print. it makes a nice, impressive single line, but clearly you don't completely understand what the pieces are, nor how they work together. Since the join is combining (concatenating) strings that are each being produced by rjust(), it's the join() that's making this look relative to you. What builtn-in function can format output, to make every character be placed as i need - relative to the first character, placed most to left side of screen. If you want to randomly place characters on the screen, you either want a curses-like package, or a gui. i suspect that's not at all what you want. if you want to randomly change characters in a pre-existing string, which will then be printed to the console, then I could suggest an approach (untested) res = [ ] * length for column in similarity: res[column] = | res = .join(res) -- DaveA Thanks, i've finally came to solution. Here it is - http://codepad.org/Q70eGkO8 def pairwiseScore(seqA, seqB): score = 0 bars = [str(' ') for x in seqA] #create a list filled with number of spaces equal to length of seqA string. It could be also seqB, because both are meant to have same length length = len(seqA) similarity = [] for x in xrange(length): if seqA[x] == seqB[x]: #check if for every index 'x', corresponding character is same in both seqA and seqB strings if (x = 1) and (seqA[x - 1] == seqB[x - 1]): #if 'x' is greater than or equal to 1 and characters under the previous index, were same in both seqA and seqB strings, do.. score += 3 similarity.append(x) else: score += 1 similarity.append(x) else: score -= 1 for x in similarity: bars[x] = '|' #for every index 'x' in 'bars' list, replace space with '|' (pipe/vertical bar) character return ''.join((seqA, '\n', ''.join(bars), '\n', seqB, '\n', 'Score: ', str(score))) print pairwiseScore(ATTCGT, ATCTAT), '\n', '\n', pairwiseScore(GATAAATCTGGTCT, CATTCATCATGCAA), '\n', '\n', pairwiseScore('AGCG', 'ATCG'), '\n', '\n', pairwiseScore('ATCG', 'ATCG') -- http://mail.python.org/mailman/listinfo/python-list
Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()]
On Sun, 19 Aug 2012 10:48:06 -0700, Paul Rubin wrote: Terry Reedy tjre...@udel.edu writes: I would call it O(k), where k is a selectable constant. Slowing access by a factor of 100 is hardly acceptable to me. If k is constant then O(k) is the same as O(1). That is how O notation works. You might as well say that if N is constant, O(N**2) is constant too and just like magic you have now made Bubble Sort a constant-time sort function! That's not how it works. Of course *if* k is constant, O(k) is constant too, but k is not constant. In context we are talking about string indexing and slicing. There is no value of k, say, k = 2, for which you can say People will sometimes ask for string[2] but never ask for string[3]. That is absurd. Since k can vary from 0 to N-1, we can say that the average string index lookup is k = (N-1)//2 which clearly depends on N. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, 19 Aug 2012 11:50:12 -0600, Ian Kelly wrote: On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: [...] The PEP explicitly states that it only uses a 1-byte format for ASCII strings, not Latin-1: I think you misunderstand the PEP then, because that is empirically false. Yes I did misunderstand. Thank you for the clarification. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Sun, 19 Aug 2012 18:03:34 +0100, Blind Anagram wrote: Steven D'Aprano wrote in message news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com... If you can consistently replicate a 100% to 1000% slowdown in string handling, please report it as a performance bug: http://bugs.python.org/ Don't forget to report your operating system. [...] This is an average slowdown by a factor of close to 2.3 on 3.3 when compared with 3.2. I am not posting this to perpetuate this thread but simply to ask whether, as you suggest, I should report this as a possible problem with the beta? Possibly, if it is consistent and non-trivial. Serious performance regressions are bugs. Trivial ones, not so much. Thanks to Terry Reedy, who has already asked the Python Devs about this issue, they have made it clear that they aren't hugely interested in micro-benchmarks in isolation. If you want the bug report to be taken seriously, you would need to run the full Python string benchmark. The results of that would be interesting to see. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Why doesn't Python remember the initial directory?
As far as I've been able to determine, Python does not remember (immutably, that is) the working directory at the program's start-up, or, if it does, it does not officially expose this information. Does anyone know why this is? Is there a PEP stating the rationale for it? Thanks! -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/19/2012 1:03 PM, Blind Anagram wrote: Running Python from a Windows command prompt, I got the following on Python 3.2.3 and 3.3 beta 2: python33\python -m timeit ('abc' * 1000).replace('c', 'de') 1 loops, best of 3: 39.3 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', '……') 1 loops, best of 3: 51.8 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', 'x…') 1 loops, best of 3: 52 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…') 1 loops, best of 3: 50.3 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', '€…') 1 loops, best of 3: 51.6 usec per loop python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç') 1 loops, best of 3: 38.3 usec per loop python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?') 1 loops, best of 3: 50.3 usec per loop python32\python -m timeit ('abc' * 1000).replace('c', 'de') 1 loops, best of 3: 24.5 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', '……') 1 loops, best of 3: 24.7 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', 'x…') 1 loops, best of 3: 24.8 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…') 1 loops, best of 3: 24 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', '€…') 1 loops, best of 3: 24.1 usec per loop python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç') 1 loops, best of 3: 24.4 usec per loop python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?') 1 loops, best of 3: 24.3 usec per loop This is one test repeated 7 times with essentially irrelevant variations. The difference is less on my system (50%). Others report seeing 3.3 as faster. When I asked on pydev, the answer was don't bother making a tracker issue unless I was personally interested in investigating why search is relatively slow in 3.3 on Windows. Any change would have to not slow other operations or severely impact search on other systems. I suggest the same answer to you. If you seriously want to compare old and new unicode, go to http://hg.python.org/cpython/file/tip/Tools/stringbench/stringbench.py and click raw to download. Run on 3.2 and 3.3, ignoring the bytes times. Here is a version of the first comparison from stringbench: print(timeit('''('NOW IS THE TIME FOR ALL GOOD PEOPLE TO COME TO THE AID OF PYTHON'* 10).lower()''')) Results are 5.6 for 3.2 and .8 for 3.3. WOW! 3.3 is 7 times faster! OK, not fair. I cherry picked. The 7 times speedup in 3.3 likely is at least partly independent of the 393 unicode change. The same test in stringbench for bytes is twice as fast in 3.3 as 3.2, but only 2x, not 7x. In fact, it may have been the bytes/unicode comparison in 3.2 that suggested that unicode case conversion of ascii chrs might be made faster. The sum of the 3.3 unicode times is 109 versus 110 for 3.3 bytes and 125 for 3.2 unicode. This unweighted sum is not really fair since the raw times vary by a factor of at least 100. But is does suggest that anyone claiming that 3.3 unicode is overall 'slower' than 3.2 unicode has some work to do. There is also this. On my machine, the lowest bytes-time/unicode-time for 3.3 is .71. This suggests that there is not a lot of fluff left in the unicode code, and that not much is lost by the bytes to unicode switch for strings. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: Why doesn't Python remember the initial directory?
Il giorno domenica 19 agosto 2012 22:42:16 UTC+2, kj ha scritto: As far as I've been able to determine, Python does not remember (immutably, that is) the working directory at the program's start-up, or, if it does, it does not officially expose this information. Does anyone know why this is? Is there a PEP stating the rationale for it? Thanks! You can obtain the working directory with os.getcwd(). giacomo@jack-laptop:~$ echo 'import os; print os.getcwd()' testing-dir.py giacomo@jack-laptop:~$ python testing-dir.py /home/giacomo giacomo@jack-laptop:~$ cd Documenti giacomo@jack-laptop:~/Documenti$ python ../testing-dir.py /home/giacomo/Documenti giacomo@jack-laptop:~/Documenti$ Obviously using os.chdir() will change the working directory, and the os.getcwd() will not be the start-up working directory, but if you need the start-up working directory you can get it at start-up and save it in some constant. -- http://mail.python.org/mailman/listinfo/python-list
Re: Why doesn't Python remember the initial directory?
In article k0rj38$2gc$1...@reader1.panix.com, kj no.em...@please.post wrote: As far as I've been able to determine, Python does not remember (immutably, that is) the working directory at the program's start-up, or, if it does, it does not officially expose this information. Why would you expect that it would? What would it (or you) do with this information? More to the point, doing a chdir() is not something any library code would do (at least not that I'm aware of), so if the directory changed, it's because some application code did it. In which case, you could have just stored the working directory yourself. -- http://mail.python.org/mailman/listinfo/python-list
Re: Why doesn't Python remember the initial directory?
On 19/08/2012 21:42, kj wrote: As far as I've been able to determine, Python does not remember (immutably, that is) the working directory at the program's start-up, or, if it does, it does not officially expose this information. Does anyone know why this is? Is there a PEP stating the rationale for it? Thanks! Why would you have a Python Enhancement Proposal to state the rationale for this? -- Cheers. Mark Lawrence. -- http://mail.python.org/mailman/listinfo/python-list
Re: Why doesn't Python remember the initial directory?
On 2012-08-19 22:42, kj wrote: As far as I've been able to determine, Python does not remember (immutably, that is) the working directory at the program's start-up, or, if it does, it does not officially expose this information. Does anyone know why this is? Is there a PEP stating the rationale for it? Thanks! When you start the program, you have a current directory. When you change it, then it is changed. How do you want Python to remember a directory? For example, you can put it into a variable, and use it later. Can you please show us some example code that demonstrates your actual problem? -- http://mail.python.org/mailman/listinfo/python-list
Re: New internal string format in 3.3
On Mon, Aug 20, 2012 at 4:09 AM, Mark Lawrence breamore...@yahoo.co.uk wrote: On 19/08/2012 18:51, wxjmfa...@gmail.com wrote: Just for the story. Five minutes after a closed my interactive interpreters windows, the day I tested this stuff. I though: Too bad I did not noted the extremely bad cases I found, I'm pretty sure, this problem will arrive on the table. How convenient. Not really. Even if he HAD copied-and-pasted those worst-cases, it'd prove nothing. Maybe his system just chose to glitch right then. It's always possible to find statistical outliers that take way way longer than everything else. Watch this. Python 3.2 on Windows is optimized for adding 1 to numbers. C:\Documents and Settings\M\python32\python -m timeit -r 1 a=1+1 1000 loops, best of 1: 0.0654 usec per loop C:\Documents and Settings\M\python32\python -m timeit -r 1 a=1+1 1000 loops, best of 1: 0.0654 usec per loop C:\Documents and Settings\M\python32\python -m timeit -r 1 a=1+1 1000 loops, best of 1: 0.0654 usec per loop C:\Documents and Settings\M\python32\python -m timeit -r 1 a=1+2 1000 loops, best of 1: 0.0711 usec per loop Now, as long as I don't tell you that during the last test I had quite a few other processes running, including VLC playing a movie and two Python processes running while True: pass, this will look like a significant performance difference. So now, I'm justified in complaining about how suboptimal Python is when adding 2 to a number, which I can assure you is a VERY common case. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/19/2012 2:11 PM, wxjmfa...@gmail.com wrote: Well, it seems some software producers know what they are doing. '€'.encode('cp1252') b'\x80' '€'.encode('mac-roman') b'\xdb' '€'.encode('iso-8859-1') Traceback (most recent call last): File eta last command, line 1, in module UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256) Yes, Python lets you choose your byte encoding from those and a hundred others. I believe all the codecs are now tested in both directions. It was not an easy task. As to the examples: Latin-1 dates to 1985 and before and the 1988 version was published as a standard in 1992. https://en.wikipedia.org/wiki/Latin-1 The name euro was officially adopted on 16 December 1995. https://en.wikipedia.org/wiki/Euro No wonder Latin-1 does not contain the Euro sign. International standards organizations standards are relatively fixed. (The unicode consortium will not even correct misspelled character names.) Instead, new standards with a new number are adopted. For better or worse, private mappings are more flexible. In its Mac mapping Apple replaced the generic currency sign ¤ with the euro sign €. (See Latin-1 reference.) Great if you use Euros, not so great if you were using the previous sign for something else. Microsoft changed an unneeded code to the Euro for Windows cp-1252. https://en.wikipedia.org/wiki/Windows-1252 It is very common to mislabel Windows-1252 text with the charset label ISO-8859-1. A common result was that all the quotes and apostrophes (produced by smart quotes in Microsoft software) were replaced with question marks or boxes on non-Windows operating systems, making text difficult to read. Most modern web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 in order to accommodate such mislabeling. This is now standard behavior in the draft HTML 5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.[1] Lots of fun. Too bad Microsoft won't push utf-8 so we can all communicate text with much less chance of ambiguity. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: ONLINE SERVER TO STORE AND RUN PYTHON SCRIPTS
On Saturday, 18 August 2012 00:42:00 UTC+5:30, Ian wrote: On Fri, Aug 17, 2012 at 6:46 AM, coldfire amangill.coldf...@gmail.com wrote: I would like to know that where can a python script be stored on-line from were it keep running and can be called any time when required using internet. I have used mechanize module which creates a webbroswer instance to open a website and extract data and email me. I have tried Python anywhere but they dont support opening of anonymous websites. According to their FAQ they don't support this for *free* accounts. You could just open a paid account (the cheapest option appears to be $5/month). Also, please don't type your email subject in all capital letters. It comes across as shouting and is considered rude. Got it and sorry for typing it CAPs I will take care of it next time for sure. Also Could u help me out with the websites. Also I have no idea how to deploy a python script online. I have done that on my local PC using Apache server and cgi but it Works fine. Whats this all called? as far as I have searched its Web Framework but I dont wont to develop a website Just a Server which can run my scripts at specific time and send me email if an error occurs. I use Python And i am not getting any lead. -- http://mail.python.org/mailman/listinfo/python-list
Re: ONLINE SERVER TO STORE AND RUN PYTHON SCRIPTS
On Friday, 17 August 2012 18:16:08 UTC+5:30, coldfire wrote: I would like to know that where can a python script be stored on-line from were it keep running and can be called any time when required using internet. I have used mechanize module which creates a webbroswer instance to open a website and extract data and email me. I have tried Python anywhere but they dont support opening of anonymous websites. What s the current what to DO this? Can someone point me in the write direction. My script have no interaction with User It just Got on-line searches for something and emails me. Thanks Sorry I never wanted to be rude. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy tjre...@udel.edu wrote: On 8/19/2012 4:04 AM, Paul Rubin wrote: I realize the folks who designed and implemented PEP 393 are very smart cookies and considered stuff carefully, while I'm just an internet user posting an immediate impression of something I hadn't seen before (I still use Python 2.6), but I still have to ask: if the 393 approach makes sense, why don't other languages do it? Python has often copied or borrowed, with adjustments. This time it is the first. We will see how it goes, but it has been tested for nearly a year already. Maybe it wasn't consciously borrowed, but whatever innovation is done, there's usually an obscure beardless language that did it earlier. :) Pike has a single string type, which can use the full Unicode range. If all codepoints are 256, the string width is 8 (measured in bits); if 65536, width is 16; otherwise 32. Using the inbuilt count_memory function (similar to the Python function used somewhere earlier in this thread, but which I can't at present put my finger to), I find that for strings of 16 bytes or more, there's a fixed 20-byte header plus the string content, stored in the correct number of bytes. (Pike strings, like Python ones, are immutable and do not need expansion room.) However, Python goes a bit further by making it VERY clear that this is a mere optimization, and that Unicode strings and bytes strings are completely different beasts. In Pike, it's possible to forget to encode something before (say) writing it to a socket. Everything works fine while you have only ASCII characters in the string, and then breaks when you have a 255 codepoint - or perhaps worse, when you have a 127x256, and the other end misinterprets it. Really, the only viable alternative to PEP 393 is a fixed 32-bit representation - it's the only way that's guaranteed to provide equivalent semantics. The new storage format is guaranteed to take no more memory than that, and provide equivalent functionality. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
In article mailman.3531.1345416176.4697.python-l...@python.org, Chris Angelico ros...@gmail.com wrote: Really, the only viable alternative to PEP 393 is a fixed 32-bit representation - it's the only way that's guaranteed to provide equivalent semantics. The new storage format is guaranteed to take no more memory than that, and provide equivalent functionality. In the primordial days of computing, using 8 bits to store a character was a profligate waste of memory. What on earth did people need with TWO cases of the alphabet (not to mention all sorts of weird punctuation)? Eventually, memory became cheap enough that the convenience of using one character per byte (not to mention 8-bit bytes) outweighed the costs. And crazy things like sixbit and rad-50 got swept into the dustbin of history. So it may be with utf-8 someday. Clearly, the world has moved to a 32-bit character set. Not all parts of the world know that yet, or are willing to admit it, but that doesn't negate the fact that it's true. Equally clearly, the concept of one character per byte is a big win. The obvious conclusion is that eventually, when memory gets cheap enough, we'll all be doing utf-32 and all this transcoding nonsense will look as antiquated as rad-50 does today. -- http://mail.python.org/mailman/listinfo/python-list
Re: Abuse of Big Oh notation
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes: Of course *if* k is constant, O(k) is constant too, but k is not constant. In context we are talking about string indexing and slicing. There is no value of k, say, k = 2, for which you can say People will sometimes ask for string[2] but never ask for string[3]. That is absurd. The context was parsing, e.g. recognizing a token like a or foo in a human-written chunk of text. Occasionally it might be sesquipidalian or some even worse outlier, but one can reasonably put a fixed and relatively small upper bound on the expected value of k. That makes the amortized complexity O(1), I think. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get initial absolute working dir reliably?
On Sunday, 19 August 2012 01:19:59 UTC+10, kj wrote: What's the most reliable way for module code to determine the absolute path of the working directory at the start of execution? Here's some very simple code that relies on the singleton nature of modules that might be enough for your needs: import os _workingdir = None def set(): global _workingdir _workingdir = os.getcwd() def get(): return _workingdir At the start of your application, import workingdir and do a workingdir.set(). Then when you need to retrieve it, import it again and use workingdir.get(): a.py: import workingdir workingdir.set() b.py: import workingdir print workingdir.get() test.py: import a import b You could also remove the need to call the .set() by implicitly assigning on the first import: if '_workingdir' not in locals(): _workingdir = os.getcwd() But I like the explicitness. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On Monday, August 20, 2012 1:03:34 AM UTC+8, Blind Anagram wrote: Steven D'Aprano wrote in message news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com... On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote: [...] If you can consistently replicate a 100% to 1000% slowdown in string handling, please report it as a performance bug: http://bugs.python.org/ Don't forget to report your operating system. For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) running Windows 7 x64. Running Python from a Windows command prompt, I got the following on Python 3.2.3 and 3.3 beta 2: python33\python -m timeit ('abc' * 1000).replace('c', 'de') 1 loops, best of 3: 39.3 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', '……') 1 loops, best of 3: 51.8 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', 'x…') 1 loops, best of 3: 52 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…') 1 loops, best of 3: 50.3 usec per loop python33\python -m timeit ('ab…' * 1000).replace('…', '€…') 1 loops, best of 3: 51.6 usec per loop python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç') 1 loops, best of 3: 38.3 usec per loop python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?') 1 loops, best of 3: 50.3 usec per loop python32\python -m timeit ('abc' * 1000).replace('c', 'de') 1 loops, best of 3: 24.5 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', '……') 1 loops, best of 3: 24.7 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', 'x…') 1 loops, best of 3: 24.8 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…') 1 loops, best of 3: 24 usec per loop python32\python -m timeit ('ab…' * 1000).replace('…', '€…') 1 loops, best of 3: 24.1 usec per loop python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç') 1 loops, best of 3: 24.4 usec per loop python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?') 1 loops, best of 3: 24.3 usec per loop This is an average slowdown by a factor of close to 2.3 on 3.3 when compared with 3.2. I am not posting this to perpetuate this thread but simply to ask whether, as you suggest, I should report this as a possible problem with the beta? Un, another set of functions for seeding up ASCII string othe pertions might be needed. But it is better that Python 3.3 supports unicode strings to be easy to be used by people in different languages first. Anyway I think Cython and Pyrex can be used to tackle this problem. -- http://mail.python.org/mailman/listinfo/python-list
Re: How do I display unicode value stored in a string variable using ord()
On 8/19/2012 6:42 PM, Chris Angelico wrote: On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy tjre...@udel.edu wrote: Python has often copied or borrowed, with adjustments. This time it is the first. I should have added 'that I know of' ;-) Maybe it wasn't consciously borrowed, but whatever innovation is done, there's usually an obscure beardless language that did it earlier. :) Pike has a single string type, which can use the full Unicode range. If all codepoints are 256, the string width is 8 (measured in bits); if 65536, width is 16; otherwise 32. Using the inbuilt count_memory function (similar to the Python function used somewhere earlier in this thread, but which I can't at present put my finger to), I find that for strings of 16 bytes or more, there's a fixed 20-byte header plus the string content, stored in the correct number of bytes. (Pike strings, like Python ones, are immutable and do not need expansion room.) It is even possible that someone involved was even vaguely aware that there was an antecedent. The PEP makes no claim that I can see, but lays out the problem and goes right to details of a Python implementation. However, Python goes a bit further by making it VERY clear that this is a mere optimization, and that Unicode strings and bytes strings are completely different beasts. In Pike, it's possible to forget to encode something before (say) writing it to a socket. Everything works fine while you have only ASCII characters in the string, and then breaks when you have a 255 codepoint - or perhaps worse, when you have a 127x256, and the other end misinterprets it. Python writes strings to file objects, including open sockets, without creating a bytes object -- IF the file is opened in text mode, which always has an associated encoding, even if the default 'ascii'. From what you say, this is what Pike is missing. I am pretty sure that the obvious optimization has already been done. The internal bytes of all-ascii text can safely be sent to a file with ascii (or ascii-compatible) encoding without intermediate 'decoding'. I remember several patches of that sort. If a string is internally ucs2 and the file is declared usc2 or utf-16 encoding, then again, pairs of bytes can go directly (possibly with a byte swap). -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: Why doesn't Python remember the initial directory?
On Sun, 19 Aug 2012 14:01:15 -0700, Giacomo Alzetta wrote: You can obtain the working directory with os.getcwd(). Maybe. On Unix, it's possible that the current directory no longer has a pathname. As with files, directories can be deleted (i.e. unlinked) even while they're still in use. Similarly, a directory can be renamed while it's in use, so the current directory's pathname may have changed even while the current directory itself hasn't. -- http://mail.python.org/mailman/listinfo/python-list
Re: Why doesn't Python remember the initial directory?
On Monday, August 20, 2012 4:42:16 AM UTC+8, kj wrote: As far as I've been able to determine, Python does not remember (immutably, that is) the working directory at the program's start-up, or, if it does, it does not officially expose this information. Does anyone know why this is? Is there a PEP stating the rationale for it? Thanks! Immutable data can be frozen and saved in somewhere off the main memory. Perative and imperative programming are different. Please check Erlang. -- http://mail.python.org/mailman/listinfo/python-list
Re: ONLINE SERVER TO STORE AND RUN PYTHON SCRIPTS
On Sun, Aug 19, 2012 at 6:27 PM, coldfire amangill.coldf...@gmail.com wrote: Also I have no idea how to deploy a python script online. I have done that on my local PC using Apache server and cgi but it Works fine. Whats this all called? as far as I have searched its Web Framework but I dont wont to develop a website Just a Server which can run my scripts at specific time and send me email if an error occurs. I use Python And i am not getting any lead. If you want to host web pages, like your're doing on your local pc with Apache and cgi, then you need an account with a web server, and a way to deploy your scripts and other content. This is often known as a 'web hosting service'[1]. The exact capabilities and restrictions will vary from provider to provider. If you just want an alway-on, internet accessable place to store and run your python scripts, you may be interested in a 'shell account'[2], or if you need more control over the environment, a 'virtual private server'[3]. That may give you a few terms to google, and see what kind of service you need. 1 http://en.wikipedia.org/wiki/Shell_account 2 http://en.wikipedia.org/wiki/Web_host 3 http://en.wikipedia.org/wiki/Virtual_private_server -- Jerry -- http://mail.python.org/mailman/listinfo/python-list
Re: Why doesn't Python remember the initial directory?
In roy-ca6d77.17031119082...@news.panix.com Roy Smith r...@panix.com writes: In article k0rj38$2gc$1...@reader1.panix.com, kj no.em...@please.post wrote: As far as I've been able to determine, Python does not remember (immutably, that is) the working directory at the program's start-up, or, if it does, it does not officially expose this information. Why would you expect that it would? What would it (or you) do with this information? More to the point, doing a chdir() is not something any library code would do (at least not that I'm aware of), so if the directory changed, it's because some application code did it. In which case, you could have just stored the working directory yourself. This means that no library code can ever count on, for example, being able to reliably find the path to the file that contains the definition of __main__. That's a weakness, IMO. One manifestation of this weakness is that os.chdir breaks inspect.getmodule, at least on Unix. If you have some Unix system handy, you can try the following. First change the argument to os.chdir below to some valid directory other than your working directory. Then, run the script, making sure that you refer to it using a relative path. When I do this on my system (OS X + Python 2.7.3), the script bombs at the last print statement, because the second call to inspect.getmodule (though not the first one) returns None. import inspect import os frame = inspect.currentframe() print inspect.getmodule(frame).__name__ os.chdir('/some/other/directory') # where '/some/other/directory' is # different from the initial directory print inspect.getmodule(frame).__name__ ... % python demo.py python demo.py __main__ Traceback (most recent call last): File demo.py, line 11, in module print inspect.getmodule(frame).__name__ AttributeError: 'NoneType' object has no attribute '__name__' I don't know of any way to fix inspect.getmodule that does not involve, directly or indirectly, keeping a stable record of the starting directory. But, who am I kidding? What needs fixing, right? That's not a bug, that's a feature! Etc. By now I have learned to expect that 99.99% of Python programmers will find that there's nothing wrong with behavior like the one described above, that it is in fact exactly As It Should Be, because, you see, since Python is the epitome of perfection, it follows inexorably that any flaw or shortcoming one may *perceive* in Python is only an *illusion*: the flaw or shortcoming is really in the benighted programmer, for having stupid ideas about programming (i.e. any idea that may entail that Python is not *gasp* perfect). Pardon my cynicism, but the general vibe from the replies I've gotten to my post (i.e. if Python ain't got it, it means you don't need it) is entirely in line with these expectations. -- http://mail.python.org/mailman/listinfo/python-list
Re: Why doesn't Python remember the initial directory?
On Sun, Aug 19, 2012 at 9:57 PM, kj no.em...@please.post wrote: By now I have learned to expect that 99.99% of Python programmers will find that there's nothing wrong with behavior like the one described above, that it is in fact exactly As It Should Be, because, you see, since Python is the epitome of perfection, it follows inexorably that any flaw or shortcoming one may *perceive* in Python is only an *illusion*: the flaw or shortcoming is really in the benighted programmer, for having stupid ideas about programming (i.e. any idea that may entail that Python is not *gasp* perfect). Pardon my cynicism, but the general vibe from the replies I've gotten to my post (i.e. if Python ain't got it, it means you don't need it) is entirely in line with these expectations. Since you have no respect for the people you're writing to, why bother? I know I certainly have no desire to spend any time at all on your problem when you say things like that. Perhaps you're looking for for the argument clinic instead? http://www.youtube.com/watch?v=RDjCqjzbvJY -- Jerry -- http://mail.python.org/mailman/listinfo/python-list
Legal: Introduction to Programming App
Good evening, I am considering developing an iOS application that would teach average people how to program in Python. The app will be sold on the Apple app store. May I develop this app? To what extent do I need to receive permission from the Python Software Foundation? To what extent do I need to recognize the Python Software Foundation in my app? Thank you, Matthew Zipf -- http://mail.python.org/mailman/listinfo/python-list
Re: Why doesn't Python remember the initial directory?
On Monday, 20 August 2012 11:57:46 UTC+10, kj wrote: This means that no library code can ever count on, for example, being able to reliably find the path to the file that contains the definition of __main__. That's a weakness, IMO. No, it's not. It's a _strength_. If you've written a library that requires absolute knowledge of its installed location in order for its internals to work, then I'm not installing your library. When I do this on my system (OS X + Python 2.7.3), the script bombs at the last print statement, because the second call to inspect.getmodule (though not the first one) returns None. So, uh, do something sane like test for the result of inspect.getmodule _before_ trying to do something invalid to it? I don't know of any way to fix inspect.getmodule that does not involve, directly or indirectly, keeping a stable record of the starting directory. Then _that is the answer_. YOU need to keep a stable record: import inspect import os THIS_FILE = os.path.join(os.getcwd(), 'this_module_name.py') frame = inspect.currentframe() print inspect.getmodule(frame).__name__ os.chdir('/some/other/directory') print inspect.getmodule(frame, _filename=THIS_FILE).__name__ But, who am I kidding? What needs fixing, right? That's not a bug, that's a feature! Etc. Right. Because that sort of introspection of objects is rare, why burden the _entire_ language with an obligation that is only required in a few places? By now I have learned to expect that 99.99% of Python programmers will find that [blah blah blah, whine whine whine]. Pardon my cynicism, but the general vibe from the replies I've gotten to my post (i.e. if Python ain't got it, it means you don't need it) is entirely in line with these expectations. Oh my god, how DARE people with EXPERIENCE in a language challenge the PRECONCEPTIONS of an AMATEUR!!! HOW DARE THEY?!?! -- http://mail.python.org/mailman/listinfo/python-list
Re: Why doesn't Python remember the initial directory?
My apologies for any double-ups and bad formatting. The new Google Groups interface seems to have effectively shat away decades of UX for something that I can only guess was generated randomly. -- http://mail.python.org/mailman/listinfo/python-list