Re: [Python-ideas] Re: Magnitude and ProtoMagnitude ABCs — primarily for argument validation
[Marco Sulla ] > Excuse me, Tim Peters, what do you think about my (probably heretical) > proposal of simply raising an exception instead of return a NaN, like > Python already do for division by zero? Sorry, I'm missing context. I don't see any other message(s) from you in this thread, so don't know what you mean. The message of mine you're replying to here said nothing about exceptions or NaNs. -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Tim] >> Sorry, I'm not arguing about this any more. Pickle doesn't work at >> all at the level of "count of bytes followed by a string". [Random832 ] > The SHORT_BINBYTES opcode consists of the byte b'C', followed by *yes > indeed* "count of bytes followed by a string". Yes, some individual opcodes do work that way. >> If you >> want to make a pickle argument that makes sense, I'm afraid you'll >> need to become familiar with how pickle works first. This is not the >> place for a pickle tutorial. >> >> Start by learning what a datetime pickle actually is. >> pickletools.dis() will be very helpful. > 0: \x80 PROTO 3 > 2: cGLOBAL 'datetime datetime' >21: qBINPUT 0 >23: CSHORT_BINBYTES b'\x07\xdf\t\x0e\x15\x06*\x00\x00\x00' >35: qBINPUT 1 >37: \x85 TUPLE1 >38: qBINPUT 2 >40: RREDUCE >41: qBINPUT 3 >43: .STOP > > The payload is ten bytes, and the byte immediately before it is in fact > 0x0a. If I pickle any byte string under 256 bytes long by itself, the > byte immediately before the data is the length. This is how I initially > came to the conclusion that "count of bytes followed by a string" was > valid. Ditto. > I did, before writing my earlier post, look into the high-level aspects > of how datetime pickle works - it uses __reduce__ to create up to two > arguments, one of which is a 10-byte string, and the other is the > tzinfo. Those arguments are passed into the date constructor and > detected by that constructor - for example, I can call it directly with > datetime(b'\x07\xdf\t\x0e\x15\x06*\x00\x00\x00') and get the same result > as unpickling. Good job! That abuse of the constructor was supposed to remain a secret ;-) > At the low level, the part that represents that first argument does > indeed appear to be "count of bytes followed by a string". I can add to > the count, add more bytes, and it will call the constructor with the > longer string. If I use pickletools.dis on my modified value the output > looks the same except for, as expected, the offsets and the value of the > argument to the SHORT_BINBYTES opcode. > > So, it appears that, as I was saying, "wasted space" would not have been > an obstacle to having the "payload" accepted by the constructor (and > produced by __reduce__ ultimately _getstate) consist of "a byte string > of >= 10 bytes, the first 10 of which are used and the rest of which are > ignored by python <= 3.5" instead of "a byte string of exactly 10 > bytes", since it would have accepted and produced exactly the same > pickle values, but been prepared to accept larger arguments pickled from > future versions. Yes, if we had done things differently from the start, things would work differently today. But what's the point? We have to live now with what _was_ done. A datetime pickle carrying a string payload with anything other than exactly 10 bytes will almost always blow up under older Pythons. and would be considered "a bug" if it didn't. Pickles are not at all intended to be forgiving (they're enough of a potential security hole without going out of their way to ignore random mysteries). It may be nicer if Python had a serialization format more deliberately designed for evolution of class structure - but it doesn't. Classes that need such a thing now typically store their own idea of a "version" number as part of their pickled state datetime never did. > ... > So have I shown you that I know enough about the pickle format to know > that permitting a longer string (and ignoring the extra bytes) would > have had zero impact on the pickle representation of values that did not > contain a longer string? Yes. If we had a time machine, it might even have proved useful ;-) > I'd already figured out half of this before > writing my earlier post; I just assumed *you* knew enough that I > wouldn't have to show my work. It's always best to show your work on a public list. Thanks for finally ;-) doing so! -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Random832 ] > Would allowing a 16-byte string in the future have increased the storage > occupied by a 10-byte string today? Would allowing a third argument in > the future have increased the storage occupied by two arguments today? > As far as I can tell the pickle format for non-primitive types isn't > _that_ fixed-width. Sorry, I'm not arguing about this any more. Pickle doesn't work at all at the level of "count of bytes followed by a string". If you want to make a pickle argument that makes sense, I'm afraid you'll need to become familiar with how pickle works first. This is not the place for a pickle tutorial. Start by learning what a datetime pickle actually is. pickletools.dis() will be very helpful. >>> That you didn't makes me wonder just where you're finding the space to put >>> the >>> fold bit. >> PEP 495 gives all the details. Short course: there are bits that are >> _always_ 0 now within some datetime pickle bytes. `fold` will abuse >> one of those always-0-now pickle bits. > And what happens to older implementations if that bit is 1? Unpickling will raise an exception, complaining that the minute value is out of range. >> Aware datetimes _within_ a zone also follow the naive time model. >> It's unfortunate that they're nevertheless called "aware" datetimes. >> >> So, sorry, even when sorting a list of aware datetimes, if they share >> a common zone it is wholly intended that they all work in naive time. > And if some of them share a common zone, then some of them will work in > naive time, and some of them will work in aware time, and some pairs > (well, triples) of them will cause problems for sort/bisect algorithms. All sorts of things may happen, yes. As I said, if you need to care, convert to UTC first. Most apps do nothing like this. > Maybe it'd be best to simply ban interzone comparisons. We cannot. Backward compatibility. If would have been better had interzone comparisons and subtraction not been supported from the start. Too late to change that. > Or have some sort of context manager to determine how arithmetic and > comparisons > work. Write a PEP ;-) -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Tim] >> Because all versions of Python expect a very specific pickle layout >> for _every_ kind of pickled object (including datetimes).. Make any >> change to the pickle format of any object, and older Pythons will >> simply blow up (raise an exception) when trying to load the new pickle >> - or do something insane with the pickle bits. It's impossible for >> older Pythons to know anything about what "the new bits" are supposed >> to mean, and there is no way to spell, in the pickle engine, "but if >> you're an older version, skip over the next N bytes". [Random832 ] > Well, you could have put some reserved bits in the original pickle > format for datetime back when it was first defined, or even just allowed > passing in a longer string for future extension purposes. Yes, we "could have" done that for all pickle formats for all types. But why on Earth would we? Pickle size is important to many apps (e.g., Zope applications can store billions of pickles in databases. and it may not be purely coincidence ;-) that Zope Corp paid for datetime development), and there would have been loud screaming about any "wasted" bytes. > That you didn't makes me wonder just where you're finding the space to put the > fold bit. PEP 495 gives all the details. Short course: there are bits that are _always_ 0 now within some datetime pickle bytes. `fold` will abuse one of those always-0-now pickle bits. >> It's not so much a "good idea" as that it's the only idea consistent >> with Python's "naive time" model. Folds and gaps don't exist in naive >> time. Indeed, the _concept_ of "time zone" doesn't really exist in >> naive time. There's _inherent_ tension between the naive time model >> and the way multi-offset time zones actually behave. So it goes. > But why does it need to be consistent? You can't compare naive datetimes > with aware ones. If you want to sort/bisect a list of datetimes, they > have to either all be naive or all be aware. So when we're talking about > how ordering works, we're fundamentally talking about how it works for > aware datetimes. Aware datetimes _within_ a zone also follow the naive time model. It's unfortunate that they're nevertheless called "aware" datetimes. So, sorry, even when sorting a list of aware datetimes, if they share a common zone it is wholly intended that they all work in naive time. Apps that can't tolerate naive time should convert to UTC first. End of problems. -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Random832 ] > A) I'm still not sure why, but I was talking about adding an int, not a > timedelta and a string. > > B) Older python versions can't make use of either utcoffset or fold, but > can ignore either of them. I don't even see why they couldn't ignore a > timedelta and a string if we felt like adding those. Because all versions of Python expect a very specific pickle layout for _every_ kind of pickled object (including datetimes).. Make any change to the pickle format of any object, and older Pythons will simply blow up (raise an exception) when trying to load the new pickle - or do something insane with the pickle bits. It's impossible for older Pythons to know anything about what "the new bits" are supposed to mean, and there is no way to spell, in the pickle engine, "but if you're an older version, skip over the next N bytes". > C) What value fold "should" have can be inferred from the time, the > utcoffset, and the tzinfo. So you are proposing to add ... something ... _instead_ of adding `fold`. Already addressed that. See above. > I'm saying that *today*, even with no 495, it [utcoffset] does provide > *a* value in all cases (even if that's sometimes the "wrong" value > for an ambiguous time). Sure. > And that value is, for any plausible tzinfo, ordered the same for > any given pair of datetimes with the same tzinfo as the datetimes > considered as naive datetimes. Yes. > There is, or appears to be, a faction that is proposing to change that > by sorting fold=1 2:15 before fold=0 2:45 even though the former is > *actually* 30 minutes later than the latter, and I am *utterly baffled* > at why they think this is a good idea. It's not so much a "good idea" as that it's the only idea consistent with Python's "naive time" model. Folds and gaps don't exist in naive time. Indeed, the _concept_ of "time zone" doesn't really exist in naive time. There's _inherent_ tension between the naive time model and the way multi-offset time zones actually behave. So it goes. >> It is true that the earlier and later of an ambiguous time in a fold >> will compare equal in their own zone, but compare not equal after >> conversion to UTC (or to any other zone in which they're not in one of >> the latter zone's folds). Is that what you're talking about? > Yes. Or two different ambiguous times, where the properly earlier one > compares greater and vice versa. I have no idea why anyone thinks this > is reasonable or desirable behavior. >From which I can guess, without asking, that you think "naive time" itself is unreasonable and undesirable ;-) -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Tim] >> It depends on how expensive .utcoffset() >> is, which in turn depends on how the tzinfo author implements it. [Alex] > No, it does not. In most time zones, UTC offset in seconds can be computed > by C code as a 4-byte integer Which is a specific implementation of .utcoffset(). Which likely has nothing to do with how most tzinfo authors will implement _their_ .utcoffset(). For example, look at any tzinfo.utcoffset() implementation that currently exists ;-) > faster > than CPython can look up the .utcoffset method. (At least for times > within a few years around now.) A programmer who makes it slower should > be fired. So any programmer who implements .utcoffset() in Python should be fired? That's the only way I can read that. > Yet I agree, "'premature optimization' applies at this time." I'm more worried now about premature firing ;-) -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Random832 ] Whether or not datetimes stored tm_gmtoff and tm_zone workalikes has no effect on semantics I can see. If, in your view, they're purely an optimization, they're just a distraction for now. If you're proposing to add them _instead_ of adding `fold`, no, that can't work, for the pickle compatibility reasons already explained. Whether something is in a fold needs to preserved across pickling, but "almost all" pickles need to be readable by older Pythons too. This is doable adding one bit, but not doable at all if we need to add arbitrary timedelta and string objects _instead_ of that bit. ... >>> (No, I don't *care* how that's not how it's defined, >> ? How what is defined?: > Just trying, unsuccessfully apparently, to head off the "no, it's > defined as working the same as a naive datetime if the tzinfo values are > the same" argument that got brought up the *last* time I made this > claim. Sorry, I still don't know what this is about. >>> it is *in fact* true for the UTC value that you will ever actually get >>> from converting the values to UTC *today*, and it's the only total >>> ordering that actually makes any sense) >> Well, you lost me there. In a post-495 world, conversion to UTC will >> work correctly in all cases. It cannot today.; > It'll provide *a* value in all cases. It will provide the correct UTC offset in all cases. > The sort order today is equivalent to using that value in all > cases unless you've got a pathological tzinfo > specifically crafted to break it. I think that's an important enough > invariant to be worth keeping, since it is the only possible way to > provide a total order in the presence of interzone comparisons. Show some code? I don't know what you're talking about. It is true that the earlier and later of an ambiguous time in a fold will compare equal in their own zone, but compare not equal after conversion to UTC (or to any other zone in which they're not in one of the latter zone's folds). Is that what you're talking about? -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Tim] >> It would be nice to have! .utcoffset() is an expensive operation >> as-is, and being able to rely on tm_gmtoff would make that dirt-cheap >> instead. [Alex] > If it is just a question of optimization, Yes. If it's more than just that, then 495 doesn't actually solve the problem of getting the correct UTC offset in all cases. > datetime objects can be extended to cache utcoffset. Note that PyPy > have recently added caching of the hash values in datetime objects. I > merged their changes in our datetime.py, but it did not look like C > implementation would benefit from it as much as pure python did. I > expect something similar from caching utcoffset: a measurable > improvement for tzinfos implemented in Python and a wash for those > implemented in C. (A more promising optimization approach is to define a C > API for tzinfo interface.) There's no answer to this. It depends on how expensive .utcoffset() is, which in turn depends on how the tzinfo author implements it. I don't care now fast it is. But, even if I did, "premature optimization" applies at this time ;-) -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Tim] >> So, on your own machine, whenever daylight time starts or ends, you >> manually change your TZ environment variable to specify the newly >> appropriate eternally-fixed-offset zone? Of course not. [Random832 ] > No, but the hybrid zone isn't what gets attached to the individual > struct tm value when you convert a time from utc (or from a POSIX > timestamp) to a timezone local value. A single fixed utc offset is > (along with the name and, yes, isdst flag). You're assuming much more than POSIX - and the ISO C standard - requirs. My description was quite explicitly about how POSIX has done it all along. tm_gmtoff and tm_zone are extensions to the standards, introduced (IIRC) by BSD. Portable code (including Python's implementation) can't assume they're available. > And pytz doesn't involve manually changing anything, it involves (as > best it can) automatically applying the value to attach to each > individual datetime value. .normalize() is a manual step. It doesn't invoke itself by magic (although I believe Stuart would like Python to add internal hooks so pytz _could_ get it invoked by magic). >> A datetime object is the Python spelling of a C struct tm, but never >> included the tm_isdst flag. > And no-one behind this proposal seems to be contemplating adding an > equivalent to tm_gmtoff, It was off the table because, for backward compatibility, we need to mess with the pickle format as little as possible. It's vital that datetimes obtained from old pickles continue to work fine, and that pickles obtained from new datetime objects work fine when loaded by older Pythons unless they actually require the new fold=1 possibility. > despite that it would serve the same disambiguation purpose and > make it much cheaper to maintain global invariants like a sort order > according to the UTC value It would be nice to have! .utcoffset() is an expensive operation as-is, and being able to rely on tm_gmtoff would make that dirt-cheap instead. > (No, I don't *care* how that's not how it's defined, ? How what is defined?: > it is *in fact* true for the UTC value that you will ever actually get > from converting the values to UTC *today*, and it's the only total > ordering that actually makes any sense) Well, you lost me there. In a post-495 world, conversion to UTC will work correctly in all cases. It cannot today.; -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Guido] >> Wouldn't it be sufficient for people in Creighton to set their timezone to >> US/Central? IIUC the Canadian DST rules are the same as the US ones. Now, >> the question may remain how do people know what to set their timezone to. >> But neither pytz nor datetime can help with that -- it is up to the >> sysadmin. [Laura Creighton] > The Canadian DST rules are not the same as the US ones, wherein lies > the rub. ? All US locations observing DST switch at 2AM on the local clock on the 2nd Sunday of March and the 1st Sunday of November (since 2007). Best I can tell, exactly the same is true of all zones observing DST in Canada and in Mexico. The province of Saskatchewan has a few areas which prefer to > be on Mountain time, which we will ignore in this mail. The bulk of > them prefer to be on Central time, which is the same as Winnipeg and > Chicago. But what happens on DST start day (early March) in > Saskachewan? Provincial law mandates that you do not change your > clock, and do not adopt daylight savings time. The people in > Saskachewan thus stick to CST all year round. And this is how > sites like http://www.timeanddate.com/worldclock/canada/saskatoon > record things, with them at CST. I'm sure Guido is only talking about places in Canada that _do_ observe DST. There are also places in the US that don't observe DST. Of course "the DST rules" only apply to locations that do observe DST, and that's what Guido is asking about. > However, the people of Creighton have decided to keep DST, in > violation of the law, so they set their clocks forward. > http://www.timeanddate.com/worldclock/canada/creighton does this > properly as well, showing them with CDT. > > But this is not quite the complete story. In many (most?) places in > Saskatchewan, the rule is understood differently. Instead of 'we keep > to CST all year long' is is understood that 'we keep central time in > the winter and mountain time in the summer'. > > This makes parsing log files from all over Saskachewan, where sometime > in March things often stop saying CST and say MDT instead rather more > interesting than the adverage person might suspect. Read question as "in general" ;-) There are plenty of exceptions to the general rules all over the world. -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Tim] >> pytz solves it by _never_ creating a hybrid tzinfo. It only uses >> eternally-fixed-offset tzinfos. For example, for a conceptual zone >> with two possible total UTC offsets (one for "daylight", one for >> "standard"), there two distinct eternally-fixed-offset tzinfo objects >> in pytz. Then an ambiguous time is resolved by _which_ specific >> tzinfo object is attached. Typically the "daylight" tzinfo for the >> first time a repeated local time appears, and the "standard" tzinfo >> for its second appearance [Laura Creighton ] > Yes. I think this is a really great idea. I have no idea why other > people disagree. Really? >> In return, you have to use .localize() and .normalize() at various >> times, because pytz's tzinfo objects themselves are completely blind >> to the possibility of the total UTC offset changing. .localize() and >> .normalize() are needed to possibly _replace_ the tzinfo object in >> use, depending on the then-current date and time. > Yes. >>OTOH, `dateutil` does create hybrid tzinfo objects. No dances are >>ever needed to possibly replace them. But it's impossible for >>dateutil's tzinfos to disambiguate times in a fold. Incidentally, >>dateutil also makes no attempt to account for transitions other than >>DST (e.g., sometimes a zone may change its _base_ ("standard") offset >>from UTC). > I find this totally unacceptable. The precise referent for "this" isn't clear to me. The lack of knowledge of base-offset transitions is a property of dateutil's implementation, due to inheriting the default .fromutc() instead of implementing its own to take advantage of all the transition info a tzfile records. It's not _inherent_ to hybrid tzinfos. Just an omission in this specific zoneinfo wrapping. > My conclusion was that hybrid tzinfo > objects were a _really stupid idea_ proposed by somebody who misunderstood > the problem, or rather only understood the most common case. Smack them > with a dead fish, https://www.youtube.com/watch?v=i9SSOWORzw4 > and get back to work. So, on your own machine, whenever daylight time starts or ends, you manually change your TZ environment variable to specify the newly appropriate eternally-fixed-offset zone? Of course not. Your OS doesn't change it for you by magic either. Your OS implements a hybrid tzinfo: it knows all by itself what the appropriate current total UTC offset is, by consulting the tzfile (or hybrid POSIX TZ rule) you attached to your account the last time you set it, however many years ago that may have been. Indeed, I don't know of any software package _other_ than pytz doing it the "switch eternally-fixed-offset zones" way. Not that it's impossible Stuart is the first person in history not to implement it in a "really stupid" way ;-) But it was more that Stuart made heroic efforts to leap the datetime design gap explained next: In all POSIX-ish OSes, hybrid tzinfos are just business as usual: a struct tm's tm_isdst flag distinguishes ambiguous times. localtime() (UTC->local) sets tm_isdst by magic for you, and mktime() (local->UTC) consults tm_isdst to disambiguate local times in a fold. A datetime object is the Python spelling of a C struct tm, but never included the tm_isdst flag. PEP 495 aims to supply a similar bit. When it does, hybrid tzinfos will do cross-zone conversions correctly in all cases. Before then, pytz uses eternally-fixed-offset zones primarily to overcome the current lack of that bit. C's localtime() and mktime() are both spelled .astimezone(...) in Python, but under the covers localtime() is essentially Python's fromutc() (which will set the new bit appropriately after 495) and mktime() is essentially .utcoffset() (which will consult the new bit after 495). That's all there is to it. "In theory" ;-) datetime will become more-or-less exactly as stupid as POSIX has been for some decades. >> So, yup, if you're thoroughly indoctrinated in pytz behavior, you will >> be accurate but appear insane to Guido ;-) At a semantic level, a >> pytz tzinfo doesn't capture the notion of a zone with offset changes - >> it doesn't even try to. All knowledge about offset changes is inside >> the .localize() and .normalize() dances. > I can see why people would like to modify it to spit out this information > when asked. I don't understand why they would like to have a hybrid > tzinfo. The notion of replacing tzinfos when they become inappropriate > fills their souls with revulsion, or something? Because, e.g., a hybrid tzinfo (after 495) _can_ do it correctly all the time, with no need for user interventions ever. Same as your OS does. That's about usability and lack of surprise. It didn't take much web cruising, e.g., to find baffled new pytz users wondering why the zone displayed "was wrong" sometimes in utterly unexceptional cases (nowhere near a fold or a gap). "Read the docs! normalize()!" Why should they _have_ to? The only other things in their life that demand they manual
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Guido] > Wouldn't it be sufficient for people in Creighton to set their timezone to > US/Central? IIUC the Canadian DST rules are the same as the US ones. Now, > the question may remain how do people know what to set their timezone to. > But neither pytz nor datetime can help with that -- it is up to the > sysadmin. As Laura's use case evolved, it seems it was more that a train traveler from Halifax to Creighton wants to tell their Halifax relatives when they'll arrive in Creighton, but (of course) expressed in Halifax time. Nobody in this case knows anything about Creighton's rules, except the traveler may be staring at a train schedule giving arrival in Creighton time anyway. While this may be beyond pytz's wizardy, nothing is too hard for datetime ;-) datetime.timezone.setcontext("datetime-sig messages from mid-Sep 2015") arrivaltime = datetime.strptime(scraped_arrival_time, "") arrivaltime = datetime.replace(arrivaltime, tzinfo=gettz("Context/Creighton")) print(arrivaltime.astimezone(gettz("Context/Halifax")) -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Tim] >> Whatever time zone the traveler's railroad schedule uses, so long as >> it sticks to just one [Laura] > This is what does not happen. Which is why I have written a python > app to perform conversions for my parents, in the past. So how did they get the right time zone rules for Creighton? >>But there's nothing new here: datetime has been around for a dozen >>years already, and nobody is proposing to add any new basic >>functionality to tzinfos. PEP 495 is only about adding a flag to >>allow correct conversion of ambiguous local times (typically at the >>end of DST, when the local clock repeats a span of times) to UTC. So >>if this were a popular use case, I expect we would already have heard >>of it. Note that Python zoneinfo wrappings are already available via, >>at least, the pytz and dateutil packages. > I am a happy user of pytz. On the other hand, I think this means that > my brain has gone through some sort of non-reversible transformation > which makes me accurate, but not exactly sane on the issue. pytz made some strange decisions, from the POV of datetime's intended tzinfo design. But it also solved a problem datetime left hanging: how to disambiguate ambiguous local times. The _intended_ way to model zones with UTC offset transitions was via what the docs call a "hybrid" tzinfo: a single object smart enough on its own to figure out, e.g., whether a datetime's date and time are in "daylight" or "standard" time. However, there's currently no way for such a tzinfo to know whether an ambiguous local time is intended to be the earlier or the later of repeated times. PEP 495 aims to plug that hole. pytz solves it by _never_ creating a hybrid tzinfo. It only uses eternally-fixed-offset tzinfos. For example, for a conceptual zone with two possible total UTC offsets (one for "daylight", one for "standard"), there two distinct eternally-fixed-offset tzinfo objects in pytz. Then an ambiguous time is resolved by _which_ specific tzinfo object is attached. Typically the "daylight" tzinfo for the first time a repeated local time appears, and the "standard" tzinfo for its second appearance. In return, you have to use .localize() and .normalize() at various times, because pytz's tzinfo objects themselves are completely blind to the possibility of the total UTC offset changing. .localize() and .normalize() are needed to possibly _replace_ the tzinfo object in use, depending on the then-current date and time. OTOH, `dateutil` does create hybrid tzinfo objects. No dances are ever needed to possibly replace them. But it's impossible for dateutil's tzinfos to disambiguate times in a fold. Incidentally, dateutil also makes no attempt to account for transitions other than DST (e.g., sometimes a zone may change its _base_ ("standard") offset from UTC). So, yup, if you're thoroughly indoctrinated in pytz behavior, you will be accurate but appear insane to Guido ;-) At a semantic level, a pytz tzinfo doesn't capture the notion of a zone with offset changes - it doesn't even try to. All knowledge about offset changes is inside the .localize() and .normalize() dances. > I think I have misunderstood Alexander Belopolsky as saying that > datetime had functionality which I don't think it has. Thus I thought > we must be planning to add some functionality here. Sorry about this. Guido told Alex to stop saying that ;-) You can already get eternally-fixed-offset classes, like pytz does, on (at least) Linux systems by setting os.environ['TZ'] and then exploiting that .astimezone() without an argument magically synthesizes an eternally-fixed-offset tzinfo for "the system zone" (which the TZ envar specifies) current total UTC offset. That's not really comparable to what pytz does, except at a level that makes a lot of sense in theory but not much at all in practice ;-) > However, people do need to be aware, if they are not already, that > people with 3 times in 3 different tz will want to sort them. Telling > them that they must convert them to UTC before they do so is, in my > opinion, a very fine idea. Expecting them to work this out by themselves > via a assertion that the comparison operator is not transitive, is, > I think, asking a lot of them. Of course. Note that it's _not_ a problem in pytz, though: there are no sorting (or transitivity) problems if the only tzinfos you ever use have eternally fixed UTC offsets. There are no gaps or folds then, and everything works in an utterly obvious way - except that you have to keep _replacing_ tzinfos when they become inappropriate for the current dates and times in the datetimes they're attached to. -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Laura] >>> But I am not sure how it is that a poor soul who just wants to print a >>> railway schedule 'in local time' is supposed to know that Creighton is >>> using Winnipeg time. [Tim] >> I'm not sure how that poor soul would get a railway schedule >> manipulable in Python to begin with ;-) [Laura] > Via Rail will give you a schedule when you book your tickets. But I > am wrong, it gives it to you in local time, which you can scrape or > even use the via rail api. So it is the person getting off in > Creighton who wants to tell his relatives back in Halifax what > time he is arriving (in their time) (so they can call him and > avoid the hellish hotel surtax on long distance calls) who will > have the problem. Whatever time zone the traveler's railroad schedule uses, so long as it sticks to just one the traveler subtracts the departure time from the arrival time to determine how long the trip takes. They add that to the Halifax time at which they depart, and tell their Halifax relatives the result. They don't need to know anything about the destination's time zone to do this, unless a daylight transition occurs between departure and arrival, and the schedule itself remembered to account for it. In which case, pragmatically, they can just add an hour "to be safe" ;-) > And this is the sort of use case I think we will see a lot of. But there's nothing new here: datetime has been around for a dozen years already, and nobody is proposing to add any new basic functionality to tzinfos. PEP 495 is only about adding a flag to allow correct conversion of ambiguous local times (typically at the end of DST, when the local clock repeats a span of times) to UTC. So if this were a popular use case, I expect we would already have heard of it. Note that Python zoneinfo wrappings are already available via, at least, the pytz and dateutil packages. -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Tim] >> Hi, Laura! By "zoneinfo" here, we mean the IANA (aka "Olson") time >> zone database, which is ubiquitous on (at least) Linux: >> >>https://www.iana.org/time-zones >> >>So "will a wrapping of zoneinfo handle XYZ?" isn't so much a question >>about the wrapping as about what's in the IANA database. [Laura] > Then we had better be able to override it when it is wrong. Anyone can write their own tzinfo implementing any rules they like, and nobody is required to use anyone else's tzinfos. That said, zoneinfo is the most extensive collection of time zone info there is, so most people will probably use only that. And that said, zoneinfo is inordinately concerned with recording highly dubious guesses about what "the rules" were even over a century ago. Most people would probably be happy with a tzinfo that _only_ knew what "today's rules" are. POSIX TZ rules give a simple way to spell exactly that. Simple, but annoyingly cryptic. Gustavo's `dateutil` already supplies a way to magically build a tzinfo implementing a zone specified by a POSIX TZ rule string. More obvious ways to spell that are surely possible (like, for example, the obvious ways). Patches welcome ;-) >> Best guess is that Creighton's rules are covered by that database's >> America/Winnipeg entries. >> >> # Saskatchewan >> # Other western towns (e.g. Lloydminster) are like Edmonton. >> # Matthews and Vincent (1998) write that Denare Beach and Creighton >> # are like Winnipeg, in violation of Saskatchewan law. > I think that this will work. > Creighton is just across the border from Flin Flan, Manitoba. Indeed I think > the problem of 'drunken people from Manitoba trying to get one hours more > drinking done and being a menace on the highway' may have fueled the > 'we are going to have DST in violation of the law' movement in Creighton. :-) > But I am not sure how it is that a poor soul who just wants to print a > railway schedule 'in local time' is supposed to know that Creighton is > using Winnipeg time. I'm not sure how that poor soul would get a railway schedule manipulable in Python to begin with ;-) If it's "a problem" for "enough" users of a computer system, a Linux admin could simply make "America/Creighton" a link to the "America/Winnipeg" tzfile. But doing that for every nameable place on Earth might be considered annoying. To cover everyone, you may even need to specify a street address within "a city": http://www.quora.com/Are-there-any-major-cities-divided-by-two-time-zones Blame politicians for this. I can assure you Guido is not responsible for creating this mess ;-) -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Alex] >>I will try to create a zoneinfo wrapping prototype as well, but I will >>probably "cheat" and build it on top of pytz. [Laura Creighton] > My question, is whether it will handle Creighton, Saskatchewan, Canada? > Creighton is an odd little place. Like all of Saskatchewan, it is > in the Central time zone, even though you would expect it to be > in the Mountain time zone based on its location on the globe. > The people of Saskatchewan have decided not to adopt Daylight > Savings time. Except for the people of Creighton (and > nearby Denare Beach) -- who _do_ observe Daylight savings time. > > makes for an interesting corner case, one that I remember for > personal (and not economic, or professional) reasons. Hi, Laura! By "zoneinfo" here, we mean the IANA (aka "Olson") time zone database, which is ubiquitous on (at least) Linux: https://www.iana.org/time-zones So "will a wrapping of zoneinfo handle XYZ?" isn't so much a question about the wrapping as about what's in the IANA database. Best guess is that Creighton's rules are covered by that database's America/Winnipeg entries. It's generally true that the database makes no attempt to name every location on the planet. Instead it uses names of the form "general/specific" where "general" limits the scope to some large area of the Earth (here "America" really means "North America"), and "specific" names a well-known city within that area. For example, I live in Ashland, Wisconsin (extreme far north in that state, on Lake Superior), but so far as IANA is concerned my time zone rules are called "America/Chicago" (some 460 air miles SSE, in a different state). Just for fun, I'll paste in the comments from the Saskatchewan section of IANA's "northamerica" data file (a plain text source file from which binary tzfiles like America/Chicago and America/Winnipeg are generated). You'll see Creighton mentioned if you stay alert ;-) # Saskatchewan # From Mark Brader (2003-07-26): # The first actual adoption of DST in Canada was at the municipal # level. As the [Toronto] Star put it (1912-06-07), "While people # elsewhere have long been talking of legislation to save daylight, # the city of Moose Jaw [Saskatchewan] has acted on its own hook." # DST in Moose Jaw began on Saturday, 1912-06-01 (no time mentioned: # presumably late evening, as below), and would run until "the end of # the summer". The discrepancy between municipal time and railroad # time was noted. # From Paul Eggert (2003-07-27): # Willett (1914-03) notes that DST "has been in operation ... in the # City of Moose Jaw, Saskatchewan, for one year." # From Paul Eggert (2006-03-22): # Shanks & Pottenger say that since 1970 this region has mostly been as Regina. # Some western towns (e.g. Swift Current) switched from MST/MDT to CST in 1972. # Other western towns (e.g. Lloydminster) are like Edmonton. # Matthews and Vincent (1998) write that Denare Beach and Creighton # are like Winnipeg, in violation of Saskatchewan law. # From W. Jones (1992-11-06): # The. . .below is based on information I got from our law library, the # provincial archives, and the provincial Community Services department. # A precise history would require digging through newspaper archives, and # since you didn't say what you wanted, I didn't bother. # # Saskatchewan is split by a time zone meridian (105W) and over the years # the boundary became pretty ragged as communities near it reevaluated # their affiliations in one direction or the other. In 1965 a provincial # referendum favoured legislating common time practices. # # On 15 April 1966 the Time Act (c. T-14, Revised Statutes of # Saskatchewan 1978) was proclaimed, and established that the eastern # part of Saskatchewan would use CST year round, that districts in # northwest Saskatchewan would by default follow CST but could opt to # follow Mountain Time rules (thus 1 hour difference in the winter and # zero in the summer), and that districts in southwest Saskatchewan would # by default follow MT but could opt to follow CST. # # It took a few years for the dust to settle (I know one story of a town # on one time zone having its school in another, such that a mom had to # serve her family lunch in two shifts), but presently it seems that only # a few towns on the border with Alberta (e.g. Lloydminster) follow MT # rules any more; all other districts appear to have used CST year round # since sometime in the 1960s. # From Chris Walton (2006-06-26): # The Saskatchewan time act which was last updated in 1996 is about 30 pages # long and rather painful to read. # http://www.qp.gov.sk.ca/documents/English/Statutes/Statutes/T14.pdf -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Alex] >>> I will try to create a zoneinfo wrapping prototype as well, but I will >>> probably "cheat" and build it on top of pytz. [Tim] >> It would be crazy not to ;-) Note that Stuart got to punt on "the >> hard part": .utcoffset(), since pytz only uses fixed-offset classes. >> For a prototype - and possibly forever after - I'd be inclined to >> create an exhaustive list of transition times in local time, parallel >> to the list of such times already there in UTC. [Alex] > Yes. The only complication is that you need four transition points instead > of two per year in a regular DST case: (1) start of gap; (2) end of gap; (3) > start of fold; and (4) end of fold. Once you know where you are with > respect to those points, figuring out utcoffset(), dst() and tzname() for > either value of fold is trivial. I wouldn't call those extras transitions - they're just warts hanging off of actual transitions. Earlier I showed Stuart how to determine everything about a possible fold from a UTC time using pytz's internal info, in PEP-431/495 Fri, 28 Aug 2015 01:01:06 -0500 He didn't reply that I saw, so it was either obvious or incomprehensible to him ;-) In any case, it just takes some very simple code once the transition record the UTC time belongs in is found. I'd be surprised if it weren't similarly easy to determine everything about a possible gap. At least in a zoneinfo wrapping, a hybrid tzinfo's .utcoffset() has to (at least internally) find "the transition record the UTC time belongs in" regardless. > ... > It's a shame though to work from a transitions in UTC list But that's what tzfiles store. It would be insane for a zoneinfo wrapping not to take advantage of that. For which reason, I do consider dateutil's zoneinfo wrapping to be insane ;-) (It inherits the default .fromutc()) Ah, BTW, I think dateutil's zoneinfo's wrapping also misunderstood some of what's actually in a tzfile. Specifically, a tzfile's " UTC/local indicators" and " standard/wall indicators" are 100% useless for anything we need, and shouldn't even be read from the file(*) (seek over 'em). > because most of DST rules are expressed in local times and then > laboriously converted into UTC. It's just a few lines of code in zoneinfo's zic.c. Nobody is doing it "by hand" there. > I think I should also implement the POSIX TZ spec tzinfo. For that you really should grab dateutil. It has a full implementation of POSIX TZ rules, as hybrid tzinfos; here from its docs: >>> tz1 = tzstr('EST+05EDT,M4.1.0,M10.5.0') >>> tz2 = tzstr('AEST-10AEDT-11,M10.5.0,M3.5.0') >>> dt = datetime(2003, 5, 8, 2, 7, 36, tzinfo=tz1) >>> dt.strftime('%X %x %Z') '02:07:36 05/08/03 EDT' >>> dt.astimezone(tz2).strftime('%X %x %Z') '16:07:36 05/08/03 AEST' Of course this implementation is tied into dateutil's rich supply of "calendar operations" too. > This is where the advantage of the "as intended" approach will be obvious. ? "As intended" is all about (to me) using hybrid tzinfos. And those are far richer in tzfiles than from POSIX rules. The latter only present us with simplest-possible DST transitions; tzfiles present us with every absurd political manipulation yet inflicted on humankind ;-) --- (*) Long boring story. Short course: those indicators are only needed, on some systems, if a POSIZ TZ rule specifies a zone offset but gives no rules at all for daylight transitions, _and_ the system has a "posixrules" tzfile. Then an insane scheme is used to make up daylight rules "as if" the source file from which the posixrules tzfile was created had been for a zone with the TZ-specified standard offset instead, and these absurd indicators are used to figure out whether the posixrules source file specified _its_ daylight rules using UTC or local times, and if the later case then whether using standard time or wall-clock time instead. It's completely nuts. -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Tim] >> Me too - except I think acceptance of 495 should be contingent upon >> someone first completing a fully functional (if not releasable) >> fold-aware zoneinfo wrapping. [Alex] > Good idea. How far are you from completing that? In my head, it was done last week ;-) In real life, I'm running out of spare time for much of anything anymore. I don't expect to be able to resume zoneinfo fiddling for at least 2 weeks. >> Details have a way of surprising, and >> we should learn from the last time we released a tzinfo spec in the >> absence of any industrial-strength wrappings using it. > I completely agree. That's why I am adding test cases like Lord Hope Island > and Vilnius to datetimetester. That helps a lot, but "industrial-strength" implies "by algorithm". There are far too many zones to deal with by crafting a hand-written class for each. > I will try to create a zoneinfo wrapping prototype as well, but I will > probably "cheat" and build it on top of pytz. It would be crazy not to ;-) Note that Stuart got to punt on "the hard part": .utcoffset(), since pytz only uses fixed-offset classes. For a prototype - and possibly forever after - I'd be inclined to create an exhaustive list of transition times in local time, parallel to the list of such times already there in UTC. An index into either list then gives an index into the other, and into the list of information about the transition (total offset, is_dst, etc). -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[Guido] >> Those pytz methods work for any (pytz) timezone -- astimezone() with a >> default argument only works for the local time zone. {Alex] > That's what os.environ['TZ'] = zonename is for. The astimezone() method > works for every timezone installed on your system. Try it - you won't even > need to call time.tzset()! I tried it. It makes no difference to anything for me. I stay on Windows to remind people that millions of Python users don't see any of the horrid nonsense Linuxish systems force on poor users ;-) > ... > In any case, there are three approaches to designing a TZ database interface > in the datetime module: the "as intended" approach, the pytz approach and > the astimezone(zonename:str) approach. Portability rules out #3, unless Python bundles its own zoneinfo wrapping. pytk's approach has many attractions, like no need for `fold` and no breakage of anything, and blazing fast .utcoffset(). Except at least arithmetic would have to be patched to do a `normalize` variant by magic (to attach the now-appropriate fixed-offset tzinfo, but without changing the clock in the process). Alas, that would be a huge speed hit for classic arithmetic. So, as always, the original intent is the only one that makes sense in the end ;-) > ... > That's why I believe PEP 495 followed by the implementation > of fold-aware "as intended" tzinfos (either within stdlib or by third > parties) is the right approach. Me too - except I think acceptance of 495 should be contingent upon someone first completing a fully functional (if not releasable) fold-aware zoneinfo wrapping. Details have a way of surprising, and we should learn from the last time we released a tzinfo spec in the absence of any industrial-strength wrappings using it. -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
>>> If there are not, maybe the intended semantics should go >> > by the wayside and be replaced by what pytz does. >> Changing anything about default arithmetic behavior is not a >> possibility. This has been beaten to death multiple times on this >> mailing list already, and I'm not volunteering for another round of it >> ;-) [Alex] > Tim and Guido only grudgingly accept it, but datetime already gives you "the > pytz way" and PEP 495 makes a small improvement to it. To be clear, "Tim and Guido" have nothing at all against timeline arithmetic. Sometimes it's exactly what you need. But the _intended_ way to get it was always to convert to UTC first, or to just use plain old timestamps. Classic arithmetic was very intentionally the default. The only "grudgingly accepted" part is that .astimezone() grew a special case later, to make the absence of an argument "mean something": > The localize/normalize functionality is provided by the .astimezone() > method which when called without arguments will attach an appropriate > fixed offset timezone to a datetime object. You can then add timedeltas > to the result and stay within a "fictitious" fixed offset timezone that > extends > indefinitely in both directions. To get back to the actual civil time - you > call .astimezone() again. This gives you what we call here a "timeline" > arithmetic and occasionally it is preferable to doing arithmetic in UTC. > (Effectively you do arithmetic in local standard time instead of UTC.) > Using a fixed offset timezone other than UTC for timeline arithmetic is > preferable in timezones that are far enough from UTC that business hours > straddle UTC midnight. The distance from UTC can't make any difference to the end result, although if you're working in an interactive shell "it's nice" to see intermediate results near current wall-clock time. "A potential problem" with .astimezone()'s default is that it _does_ create a fixed-offset zone. It's not at all obvious that it should do so. First time I saw it, my initial _expectation_ was that it "obviously" created a hybrid tzinfo reflecting the system zone's actual daylight rules, as various "tzlocal" implementations outside of Python do. -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
[] > My context is that I am working on an idea to include utc offsets in > datetime objects (or on a similar object in a new module), as an > alternative to something like a "fold" attribute. and since "classic > arithmetic" is apparently so important, Love it or hate it, it's flatly impossible to change anything about it now, for backward compatibility. > I'm trying to figure out how > "classic arithmetic" _is actually supposed to work_ when adding a > timedelta to a time lands it on the opposite side of a transition (or in > the middle of a "spring forward" gap). datetime arithmetic is defined in the Python docs. > If there is a "fall back" transition tonight, then adding a day to a > time of 12 noon today could end up as: > > 12 noon tomorrow, offset still DST. > 12 noon tomorrow, offset in standard time, 25 hours from now in real > time. > 11 AM tomorrow, offset in standard time, 24 hours from now in real time > > Which one of these is "classic arithmetic"? 12 noon tomorrow in every case, regardless of tzinfo and regardless of whether any kind of transition may or may not have occurred. Whether it is or isn't in DST in this specific case isn't defined by Python - that's entirely up to what the tzinfo implementation says. The _intended_ way of implementing tzinfos would say it was in standard time. > Pytz (if you don't > explicitly call a "normalize" function) results in something that looks > like the first. Yes, because pytz always uses a fixed-offset tzinfo. There is no difference between timeline arithmetic and classic arithmetic in any fixed-offset zone. > In one of the models I've thought of, you can get the > second by replacing the tzinfo again, or the third by doing astimezone, > but the first preserves "exactly 24 hours in the future" in both the UTC > moment and the naive interpretation by leaving the offset alone even if > it is an "unnatural" offset. > > The second one above is what you get when you call normalize. Yes. .normalize() effectively converts to UTC and back again In fact, this is all it does: def normalize(self, dt, is_dst=False): if dt.tzinfo is self: return dt if dt.tzinfo is None: raise ValueError('Naive time - no tzinfo set') return dt.astimezone(self) .fromutc() is called as the last step of .astimezone(), and .pytz overrides the default .fromutc() to plug "the appropriate" fixed-offset pytz tzinfo into the result. > My question was whether there are any real implementations that work the > intended way. dateutil, plus all implementations anyone may have written for themselves based on the Python doc examples. When datetime was originally released, there were no concrete tzinfo implementations in the world, so lots of people wrote their own for the zones they needed by copy/paste/edit of the doc examples. > If there are not, maybe the intended semantics should go > by the wayside and be replaced by what pytz does. Changing anything about default arithmetic behavior is not a possibility. This has been beaten to death multiple times on this mailing list already, and I'm not volunteering for another round of it ;-) -- https://mail.python.org/mailman/listinfo/python-list
Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?
> I was trying to find out how arithmetic on aware datetimes is "supposed > to" work, and tested with pytz. When I posted asking why it behaves this > way I was told that pytz doesn't behave correctly according to the way > the API was designed. You were told (by me) that its implementation of tzinfos was not the _intended_ way. Which is another way of saying it was an unanticipated way. "Correctly" is a whole different kind of judgment. pytz users who faithfully follow the docs seem happy with it. > The tzlocal module, on the other hand, appears to > simply defer to pytz on Unix systems. > > My question is, _are_ there any correct reference implementations that > demonstrate the proper behavior in the presence of a timezone that has > daylight saving time transitions? Which specific "proper behaviors"? :"Hybrid" tzinfos following the recommendations in the Python docs, including the sample implementations in the docs, correctly mimic local clock behavior (skipping the clock ahead when DST starts, and moving the clock back when DST ends) when converting from UTC. It's impossible now to do local -> UTC conversions correctly in all cases, because it's impossible now to know which UTC time was intended for a local time in a fold. For the same reason, it's impossible now to know whether a local time in a fold is intended to be viewed as being in daylight time or standard time. But do note limitations of the default .fromutc() implementation: it only guarantees correct mimic-the-local-clock behavior when total-offset transitions are solely due to a notion of "daylight time" that strictly alternates between .dst() returning zero and non-zero values. Transitions due to any other reason may or may not be reflected in .fromutc()'s treatment of the local clock. Most importantly, a transition due to a zone changing its base ("standard") UTC offset is a possibility the default .fromutc() knows nothing about. The wrapping of the IANA ("Olson") zoneinfo database in dateutil uses hybrid tzinfos (the intended way of wrapping zones with multiple UTC offsets), and inherits the default .fromutc(), so all the above applies to it. Including all behaviors stemming from the impossibility of disambiguating local times in a fold. That's not a bug in dateutil. It's a gap in datetime's design, It was an intentional gap at the time, but that pytz went to such heroic lengths to fill it suggests PEP 495 may well be overdue ;-) -- https://mail.python.org/mailman/listinfo/python-list
Re: TopSort in Python?
[EMAIL PROTECTED] > Thanks for the topsort code. It would be useful in a project I'm > working on. Can I use the code for free under public domain? Thanks! If I ran the world, everything I posted to a public place would be public domain. Alas, the last lawyer who typed at me about this insisted that an individual in the USA cannot meaningfully disclaim copyright, so perhaps you're required to pay me millions of dollars instead ;-) To keep him happy and you solvent, I hereby license the code under the MIT license: http://www.opensource.org/licenses/mit-license.php Copyright (c) 1999-2008 Tim Peters Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -- http://mail.python.org/mailman/listinfo/python-list
Re: Signed zeros: is this a bug?
[attribution lost] ... >>> Yup: the workaround seems to be as simple as replacing all occurrences >>> of -0.0 with -(0.0). I'm embarrassed that I didn't figure this out >>> sooner. >>> >>> >>> x, y = -(0.0), 0.0 >>> >>> x, y >>> (-0.0, 0.0) [Alex Martelli] >> Glad it works for you, but it's the kind of workaround that could >> break with any minor tweak/optimization to the compiler... >> very fragile:-(. [also Alex] > OTOH, Python 2.4 works just fine...: > > Python 2.4.3 (#1, Apr 7 2006, 10:54:33) > [GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> 0.0,-0.0 > (0.0, -0.0) > >>> -0.0,0.0 > (-0.0, 0.0) > > so it seems to be very specifically a 2.5 problem. It's a bug that keeps resurfacing, probably because there's no portable way to test that it stays fixed :-( (it's not an accident that the OP relied on atan2 to distinguish +0.0 from -0.0! they act the same in /almost/ all contexts). Python's grammar doesn't have negative numeric literals. Instead "-" CONSTANT_LITERAL looks like unary minus applied to the non-negative CONSTANT_LITERAL. All early versions of Python worked that way too. The first time the +/- 0.0 bug occurred is when the front end was changed to act as if "-" CONSTANT_LITERAL were a literal in its own right, and then that +0.0 == -0.0 caused the first instance of either in a compilation unit to be used as the value for all instances of both throughout the compilation unit. That was fixed by refusing to apply the optimimization if the value of CONSTANT_LITERAL was a float that compared equal to 0.0. 2.5 introduced a new front end and more ambitious constant-folding, and I expect the bug showed up again due to one of those. Note: printing the value of a float 0 may not reveal its sign. IIRC, glibc's float-to-string routines do display the sign of 0, but Microsoft's do not. However, atan2() is sensitive to the sign under both glibm and Microsoft's libm equivalent. -- http://mail.python.org/mailman/listinfo/python-list
Re: doctest problem with null byte
[Stuart D. Gathman] > I am trying to create a doctest test case for the following: > > def quote_value(s): > """Quote the value for a key-value pair in Received-SPF header > field if needed. No quoting needed for a dot-atom value. > > >>> quote_value(r'abc\def') > '"abcdef"' > >>> quote_value('abc..def') > '"abc..def"' > >>> quote_value('') > '""' > >>> quote_value('-all\x00') > '"-allx00"' > ... > > However, doctest does *not* like the null byte in the example (yes, > this happens with real life input): > ** > File "/home/stuart/pyspf/spf.py", line 1453, in spf.quote_value > Failed example: > quote_value('-all') > Exception raised: > Traceback (most recent call last): > File > "/var/tmp/python2.4-2.4.4c1-root/usr/lib/python2.4/doctest.py", > line 1248, in __run > compileflags, 1) in test.globs > TypeError: compile() expected string without null bytes > ** > > How can I construct a test cast for this? As the docs say, doctest examples are parsed /twice/: once when Python compiles the module and creates string objects, and again when doctest extracts and passes example substrings to the compile() function (which you can see in your traceback). The first time, the \x00 in the quote_value('-all\x00') portion of your docstring got changed into an honest-to-goodness NUL byte. Passing that to compile() can't work. To alleviate that, and the "leaning toothpick syndrome" in your other examples, a simple approach is to make your docstring a raw string (stick the letter 'r' in front of the opening triple-quote). For example, this works fine: def f(): r""" >>> len("\\") 1 >>> ord("\x00") 0 """ Because it's a raw string, Python does /not/ change the \x00 portion into a NUL byte when it compile the module. Instead the docstring continues to contain the 4-character substring: \x00 and that's what compile() needs to see. To perform the same test without using a raw string requires slamming in more backslashes (or other ad-hoc escaping tricks): def g(): """ >>> len("") 1 >>> ord("\\x00") 0 """ -- http://mail.python.org/mailman/listinfo/python-list
Re: Maths error
[Nick Maclaren] >> ... >> Yes, but that wasn't their point. It was that in (say) iterative >> algorithms, the error builds up by a factor of the base at every >> step. If it wasn't for the fact that errors build up, almost all >> programs could ignore numerical analysis and still get reliable >> answers! >> >> Actually, my (limited) investigations indicated that such an error >> build-up was extremely rare - I could achieve it only in VERY >> artificial programs. But I did find that the errors built up faster >> for higher bases, so that a reasonable rule of thumb is that 28 >> digits with a decimal base was comparable to (say) 80 bits with a >> binary base. [Hendrik van Rooyen] > I would have thought that this sort of thing was a natural consequence > of rounding errors - if I round (or worse truncate) a binary, I can be > off by at most one, with an expectation of a half of a least > significant digit, while if I use hex digits, my expectation is around > eight, and for decimal around five... Which, in all cases, is a half ULP at worst (when rounding -- as everyone does now). > So it would seem natural that errors would propagate > faster on big base systems, AOTBE, but this may be > a naive view.. I don't know of any current support for this view. It the bad old days, such things were often confused by architectures that mixed non-binary bases with "creative" rounding rules (like truncation indeed), and it could be hard to know where to "pin the blame". What you will still see stated is variations on Kahan's telegraphic "binary is better than any other radix for error analysis (but not very much)", listed as one of two techincal advantages for binary fp in: http://www.cs.berkeley.edu/~wkahan/MktgMath.pdf It's important to note that he says "error analysis", not "error propagation" -- regardless of base in use, rounding is good to <= 1/2 ULP. A fuller elementary explanation of this can be found in David Goldberg's widely available "What Every Computer Scientist Should Know About Floating-Point", in its "Relative Error and Ulps" section. The short course is that rigorous forward error analysis of fp algorithms is usually framed in terms of relative error: given a computed approximation x' to the mathematically exact result x, what's the largest possible absolute value of the mathematical r = (x'-x)/x (the relative error of x')? This framework gets used because it's more- or-less tractable, starting by assuming inputs are exact (or not, in which case you start by bounding the inputs' relative errors), then successively computing relative errors for each step of the algorithm. Goldberg's paper, and Knuth volume 2, contain many introductory examples of rigorous analysis using this approach. Analysis of relative error generally goes along independent of FP base. It's at the end, when you want to transform a statement about relative error into a statement about error as measured by ULPs (units in the last place), where the base comes in strongly. As Goldberg explains, the larger the fp base the sloppier the relative-error-converted-to-ULPs bound is -- but this is by a constant factor independent of the algorithm being analyzed, hence Kahan's "... better ... but not very much". In more words from Goldberg: Since epsilon [a measure of relative error] can overestimate the effect of rounding to the nearest floating-point number by the wobble factor of B [the FP base, like 2 for binary or 10 for decimal], error estimates of formulas will be tighter on machines with a small B. When only the order of magnitude of rounding error is of interest, ulps and epsilon may be used interchangeably, since they differ by at most a factor of B. So that factor of B is irrelevant to most apps most of the time. For a combination of an fp algorithm + set of inputs near the edge of giving gibberish results, of course it can be important. Someone using Python's decimal implementation has an often very effective workaround then, short of writing a more robust fp algorithm: just boost the precision. -- http://mail.python.org/mailman/listinfo/python-list
Re: Maths error
[Tim Peters] ... >> Huh. I don't read it that way. If it said "numbers can be ..." I >> might, but reading that way seems to requires effort to overlook the >> "decimal" in "decimal numbers can be ...". [Nick Maclaren] > I wouldn't expect YOU to read it that way, Of course I meant "putting myself in others' shoes, I don't ...". > but I can assure you from experience that many people do. Sure. Possibly even most. Short of writing a long & gentle tutorial, can that be improved? Alas, most people wouldn't read that either <0.5 wink>. > What it MEANS is "Numbers with a short representation in decimal "short" is a red herring here: Python's Decimal constructor ignores the precision setting, retaining all the digits you give. For example, if you pass a string with a million decimal digits, you'll end up with a very fat Decimal instance -- no info is lost. > can be represented exactly in decimal arithmetic", which is > tautologous. What they READ it to mean is "One advantage of > representing numbers in decimal is that they can be represented > exactly", and they then assume that also applies to pi, sqrt(2), > 1/3 > > The point is that the "decimal" could apply equally well to the > external or internal representation and, if you aren't fairly > clued-up in this area, it is easy to choose the wrong one. Worse, I expect most people have no real idea of that there's a possible difference between internal and external representations. This is often given as a selling point for decimal arithmetic: it's WYSIWYG in ways binary fp can't be (short of inventing power-of-2 fp representations for I/O, which few people would use). [attribution lost] >>>>> and how is decimal no better than binary? [Tim] >>>> Basically, they both lose info when rounding does occur. For >>>> example, [Nick] >>> Yes, but there are two ways in which binary is superior. Let's >>> skip the superior 'smoothness', as being too arcane an issue for >>> this group, >> With 28 decimal digits used by default, few apps would care about >> this anyway. > Were you in the computer arithmetic area during the "base wars" of the > 1960s and 1970s that culminated with binary winning out? Yes, although I came in on the tail end of that and never actually used a non-binary machine. > A lot of very well-respected numerical analysts said that larger bases > led to a faster build-up of error (independent of the precision). My > limited investigations indicated that there was SOME truth in that, > but it wasn't a major matter; I never say the matter settled > definitively. My point was that 28 decimal digits of precision is far greater than supplied even by 64-bit binary floats today (let alone the smaller sizes in most-common use back in the 60s and 70s). "Pollution" of low-order bits is far less of a real concern when there are some number of low- order bits you don't care about at all. >>> and deal with the other. In binary, calculating the mid-point >>> of two numbers (a very common operation) is guaranteed to be within >>> the range defined by those numbers, or to over/under-flow. >>> >>> Neither (x+y)/2.0 nor (x/2.0+y/2.0) are necessarily within the range >>> (x,y) in decimal, even for the most respectable values of x and y. >>> This was a MAJOR "gotcha" in the days before binary became standard, >>> and will clearly return with decimal. >> I view this as being an instance of "lose info when rounding does >> occur". For example, > No, absolutely NOT! Of course it is. If there were no rounding errors, the computed result would be exactly right -- that's darned near tautological too. You snipped the examples I gave showing exactly where and how rounding error created the problems in (x+y)/2 and x/2+y/2 for some specific values of x and y using decimal arithmetic. If you don't like those examples, supply your own, and if you get a similarly surprising result you'll find rounding error(s) occur(s) in yours too. It so happens that rounding errors in binary fp can't lead to the same counterintuitive /outcome/, essentially because x+x == y+y implies x == y in base 2 fp, which is indeed a bit of magic specific to base 2. The fact that there /do/ exist fp x and y such that x != y yet x+x == y+y in bases > 2 is entirely due to fp rounding error losing info. > This is an orthogonal matter, Disagree. > and is about the loss of an important invariant when using any base > above 2. It is that. > Back in the days when the
Re: Maths error
[Tim Peters] ... >|> Well, just about any technical statement can be misleading if not >|> qualified to such an extent that the only people who can still >|> understand it knew it to begin with <0.8 wink>. The most dubious >|> statement here to my eyes is the intro's "exactness carries over >|> into arithmetic". It takes a world of additional words to explain >|> exactly what it is about the example given (0.1 + 0.1 + 0.1 - 0.3 = >|> 0 exactly in decimal fp, but not in binary fp) that does, and does >|> not, generalize. Roughly, it does generalize to one important >|> real-life use-case: adding and subtracting any number of decimal >|> quantities delivers the exact decimal result, /provided/ that >|> precision is set high enough that no rounding occurs. [Nick Maclaren] > Precisely. There is one other such statement, too: "Decimal numbers > can be represented exactly." What it MEANS is that numbers with a > short representation in decimal can be represented exactly in decimal, > which is tautologous, but many people READ it to say that numbers that > they are interested in can be represented exactly in decimal. Such as > pi, sqrt(2), 1/3 and so on Huh. I don't read it that way. If it said "numbers can be ..." I might, but reading that way seems to requires effort to overlook the "decimal" in "decimal numbers can be ...". [attribution lost] >|>> and how is decimal no better than binary? >|> Basically, they both lose info when rounding does occur. For >|> example, > Yes, but there are two ways in which binary is superior. Let's skip > the superior 'smoothness', as being too arcane an issue for this > group, With 28 decimal digits used by default, few apps would care about this anyway. > and deal with the other. In binary, calculating the mid-point > of two numbers (a very common operation) is guaranteed to be within > the range defined by those numbers, or to over/under-flow. > > Neither (x+y)/2.0 nor (x/2.0+y/2.0) are necessarily within the range > (x,y) in decimal, even for the most respectable values of x and y. > This was a MAJOR "gotcha" in the days before binary became standard, > and will clearly return with decimal. I view this as being an instance of "lose info when rounding does occur". For example, >>> import decimal as d >>> s = d.Decimal("." + "9" * d.getcontext().prec) >>> s Decimal("0.") >>> (s+s)/2 Decimal("1.000") >>> s/2 + s/2 Decimal("1.000") "The problems" there are due to rounding error: >>> s/2 # "the problem" in s/2+s/2 is that s/2 rounds up to exactly 1/2 Decimal("0.5000") >>> s+s # "the problem" in (s+s)/2 is that s+s rounds up to exactly 2 Decimal("2.000") It's always something ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Maths error
[Rory Campbell-Lange] >>> Is using the decimal module the best way around this? (I'm >>> expecting the first sum to match the second). It seem >>> anachronistic that decimal takes strings as input, though. [Nick Maclaren] >> As Dan Bishop says, probably not. The introduction to the decimal >> module makes exaggerated claims of accuracy, amounting to propaganda. >> It is numerically no better than binary, and has some advantages >> and some disadvantages. [Carsten Haese] > Please elaborate. Which exaggerated claims are made, Well, just about any technical statement can be misleading if not qualified to such an extent that the only people who can still understand it knew it to begin with <0.8 wink>. The most dubious statement here to my eyes is the intro's "exactness carries over into arithmetic". It takes a world of additional words to explain exactly what it is about the example given (0.1 + 0.1 + 0.1 - 0.3 = 0 exactly in decimal fp, but not in binary fp) that does, and does not, generalize. Roughly, it does generalize to one important real-life use-case: adding and subtracting any number of decimal quantities delivers the exact decimal result, /provided/ that precision is set high enough that no rounding occurs. > and how is decimal no better than binary? Basically, they both lose info when rounding does occur. For example, >>> import decimal >>> 1 / decimal.Decimal(3) Decimal("0.") >>> _ * 3 Decimal("0.") That is, (1/3)*3 != 1 in decimal. The reason why is obvious "by eyeball", but only because you have a lifetime of experience working in base 10. A bit ironically, the rounding in binary just happens to be such that (1/3)/3 does equal 1: >>> 1./3 0.1 >>> _ * 3 1.0 It's not just * and /. The real thing at work in the 0.1 + 0.1 + 0.1 - 0.3 example is representation error, not sloppy +/-: 0.1 and 0.3 can't be /represented/ exactly as binary floats to begin with. Much the same can happen if you instead you use inputs exactly representable in base 2 but not in base 10 (and while there are none such if precision is infinite, precision isn't infinite): >>> x = decimal.Decimal(1) / 2**90 >>> print x 8.077935669463160887416100508E-28 >>> print x + x + x - 3*x # not exactly 0 1E-54 The same in binary f.p. is exact, because 1./2**90 is exactly representable in binary fp: >>> x = 1. / 2**90 >>> print x # this displays an inexact decimal approx. to 1./2**90 8.07793566946e-028 >>> print x + x + x - 3*x # but the binary arithmetic is exact 0.0 If you boost decimal's precision high enough, then this specific example is also exact using decimal; but with the default precision of 28, 1./2**90 can't be represented exactly in decimal to begin with; e.g., >>> decimal.Decimal(1) / 2**90 * 2**90 Decimal("0.") All forms of fp are subject to representation and rounding errors. The biggest practical difference here is that the `decimal` module is not subject to representation error for "natural" decimal quantities, provided precision is set high enough to retain all the input digits. That's worth something to many apps, and is the whole ball of wax for some apps -- but leaves a world of possible "surprises" nevertheless. -- http://mail.python.org/mailman/listinfo/python-list
Re: Q: About a programming problem with Python!
[followups set to comp.lang.python] [Danny] > I am just getting into OOP using Python and > because of the floating point of large integer (L) > square roots all the decimal expansions are truncated. > This has created a problem because I need these > decimal expansions in my algorithm other wise > it adds more sub iterations slowing down > greatly to the programs performance. > > Is there any way of retrieving these truncated decimal > expansions up to a certain (n) with Python? You want comp.lang.python for this, not sci.math. Be sure to give a concrete example, giving a specific input and showing exactly as possible what you /want/ to achieve. There are so many possible ambiguities in what you wrote here that few people will take time to guess at what you might have intended to ask. > Java and I believe c++ do not have this problem > but I am not about to learn any of these because > I like Python. Programming language has little to nothing to do with this, it's the data type you're using. In Java, C++, and Python, if you're using the native double-precision floating-point type, you're limited to (on most machines) at most 53 bits of precision. If you need more than that, you need to use a different data type. For example, in recent versions of Python you could use the `decimal` module, and set its precision to virtually anything you like; e.g., under Python 2.5, >>> import decimal >>> t = decimal.Decimal(2) >>> decimal.getcontext().prec # show the default precision 28 >>> print t.sqrt() # sqrt(2) rounded to 28 significant digits 1.414213562373095048801688724 >>> decimal.getcontext().prec = 56 # double the precision >>> print t.sqrt() # sqrt(2) to rounded to 56 significant digits 1.4142135623730950488016887242096980785696718753769480732 -- http://mail.python.org/mailman/listinfo/python-list
Re: a question on python dict
[Tim Peters] >> ... >> Taking my response out of context to begin with doesn't really change >> that I answered the question he asked ;-) [Fredrik Lundh] > welcome to comp.lang.python. > > Thanks for the welcome! It's tough to be a newbie here ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: a question on python dict
[Tim Peters] >>>> You should also note that copying a dict key or value (no matter of >>>> what type) consists in its entirety of copying one machine address (a >>>> 4- or 8-byte pointer, depending on platform). [Lawrence D'Oliveiro] >>> Actually, no. It also consists of updating reference counts as well. [Tim Peters] >> Not in context: dict resizing is refcount-neutral... [Lawrence D'Oliveiro] > The question wasn't about resizing, it was about "copying a dict key or > value". Then you misunderstood the OP. He was asking about dict resizing: ... When there are more and more items being added to the hashtable, it increase its buckets and copy the old items into the new one and re-calculate the hash value of each item. I think this will waste some time doing the increasing-copying thing. Taking my response out of context to begin with doesn't really change that I answered the question he asked ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: a question on python dict
[Tim Peters] >> You should also note that copying a dict key or value (no matter of >> what type) consists in its entirety of copying one machine address (a >> 4- or 8-byte pointer, depending on platform). [Lawrence D'Oliveiro] > Actually, no. It also consists of updating reference counts as well. Not in context: dict resizing is refcount-neutral, and the CPython implementation of dict resizing exploits that (quite directly in 2.5; indirectly before 2.5). It's not just that refcounts end up the same /across/ dict resizing, it's that they're not changed internally either for the duration of the resizing operation. -- http://mail.python.org/mailman/listinfo/python-list
Re: a question on python dict
[EMAIL PROTECTED] > Python dict is a hash table, isn't it? Yup. > I know that hashtable has the concept of "bucket size" and "min bucket > count" stuff, Some implementations of hash tables do. Python's does not. Python's uses what's called "open addressing" instead. > and they should be configurable so I can set them to the proper value > when I know how many items I'm going to handle. > > If these two values can't be set, the hashtable will give them default > values. When there are more and more items being added to the > hashtable, it increase its buckets and copy the old items into the new > one and re-calculate the hash value of each item. That's several distinct claims, each of which is true of some hash table implementations but not of others. Python's implementation has no "buckets", so all that's simply irrelevant. Python's implementation (not all do) saves the hash value of each item, so when it does need to grow (or shrink) the space allocated to the table, it does not need to recalculate the hash value of any item. You should also note that copying a dict key or value (no matter of what type) consists in its entirety of copying one machine address (a 4- or 8-byte pointer, depending on platform). > I think this will waste some time doing the increasing-copying thing. It does take time to resize. "Waste" is a value judgment ;-) > If I know I'm going to handle about 2 items, I can set the size of > hashtable to 3. > > So, can I do this in python? You can't. Unless you build up to 2 items over and over (many thousands of times), it's unlikely you'd be able to measure the time difference even if you could. -- http://mail.python.org/mailman/listinfo/python-list
Re: merits of Lisp vs Python
[Bill Atkins] >> (Why are people from c.l.p calling parentheses "brackets"?) [Kaz Kylheku] > Because that's what they are often called outside of the various > literate fields. For example, the English are "outside of the various literate fields"? FWIW, Python documentation consistently uses the jargon: () parentheses {} braces [] brackets That matches North American conventions, but occasionally confuses an international audience (for example, the English call parentheses "brackets" or "round brackets"). There's also a long tradition in both mathematics and computer science of using "bracket" as a generic term for any syntactic device used in pairs. For example, the "Revised Report on the Algorithmic Language Algol 60" way back in 1963 even called "begin" and "end" brackets. If it's tempting to call the authors of that illiterate too, keep in mind that John McCarthy was one of them -- although I'm sure Peter Naur would be willing to take the blame for dumbing it down for Europeans ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: not a big deal or anything, but, curiously:
[Simon Schuster] > following this tutorial, Which tutorial? > I copied and pasted: > > from string import * > > cds = """atgagtgaacgtctgagcattagctccgtatatcggcgcacaaa > tttcgggtgccgacctgacgcgcccgttaagcgataatcagtttgaacagctttaccatgcggtg > ctgcgccatcaggtggtgtttctacgcgatcaagctattacgccgcagcagcaacgcgcgctggc > ccagcgggcgaattgcatattcaccctgtttacccgcatgccgaattgacgagatca > tcgtgctggatacccataacgataatccgccagataacgacaactggcataccgatgtgacattt > attgaaacgccacccgcacgattctggcagctaaagagttaccttcgaccggcggtgatac > gctctggaccagcggtattgcggcctatgaggcgctctctgttcccttccgccagctgctgagtg > ggctgcgtgcggagcatgatttccgtaaatcgttcccggaatacaaataccgcccgaggag > gaacatcaacgctggcgcgaggcggtcgcgacccgccgttgctacatccggtggtgcgaac > gcatccggtgagcggtaaacaggcgctgtttgtgaatgaaggctttactacgcgaattgttgatg > tgagcgagaaagagagcgaagccttgttaagttgtttgcccatatcaccaaaccggagttt > caggtgcgctggcgctggcaaccaaatgatattgcgatttgggataaccgcgtgacccagcacta > tgccaatgccgattacctgccacagcgacggataatgcatcgggcgacgatccttataaac > cgatcgggctaa""".replace("\n","") > > gc = float(count(cds, 'g') + count(cds, 'c'))/ len(cds) > > print gc > > -fin- > > which should yield: 0.54460093896713613.. > > but when I ran it I got: 0.544600938967 > > looking now I see it's truncating after a certain number of decimal > places. any ideas why? Read the Python Tutorial appendix on floating-point issues: http://docs.python.org/tut/node16.html As it says, str(a_float) rounds to 12 significant digits, and repr(a_float) to 17. The `print` statement implicitly applies str() to each item it prints. -- http://mail.python.org/mailman/listinfo/python-list
Re: merits of Lisp vs Python
[Paddy] >> http://en.wikipedia.org/wiki/Doctest [Kaz Kylheku] > I pity the hoplelessly anti-intellectual douche-bag who inflicted this > undergraduate misfeature upon the programming language. As a blind misshapen dwarf, I get far too much pity as it is, but I appreciate your willingness to share yours. If you like, feel free to give this precious gift to another. I get the impression that pity is something you don't feel often, and it would be a shame to waste it. > This must be some unofficial patch that still has a hope of being shot > down in flames, right? Alas, no. It was secretly snuck into Python's standard library in the 2.1 release (2001), and although we've done our best to hide it, it's widely used now. Strangely enough, it's especially popular among intellectuals ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Python spam?
[Aahz] >>> Anyone else getting "Python-related" spam? So far, I've seen messages >>> "from" Barry Warsaw and Skip Montanaro (although of course header >>> analysis proves they didn't send it). [Thomas Heller] >> I'm getting spam not only from Barry, but also from myself ;-) with >> forged headers. But I'm not sure what you mean with Python-related... >> Not the contents, IIRC. [Aahz[ > Thing is, I don't usually get spam "from" people I know (or at least it > gets filtered before I see it), so someone is clearly using some resource > of Python-related email addresses. Just seems rather odd, and I wonder > whether it's some kind of DoS attack or what. It's been going on for years. The most frequent forged "Python related" sender address I've seen is actually [EMAIL PROTECTED], followed (but not closely) by /F's [EMAIL PROTECTED] Spammers harvest legit addresses to forge via scraping web pages and via Trojans scouring newbies' address books. The latter probably accounts for the very high rate of [EMAIL PROTECTED] forgeries. Via the former, anyone with a "public" email address can expect to see it get forged sooner or later. About two years ago I sent a polite email to a porn vendor asking them to please stop forging one of my email addresses as the sender of their spam. They didn't reply, but within a few days I stopped receiving porn spam claiming to come from me. Frankly, I miss it :-) -- http://mail.python.org/mailman/listinfo/python-list
Re: syntax error in sum(). Please explicate.
[Matt Moriarity] >> try surrounding your sum argument in brackets: >> >> sum([phi(x // ps[i+1], i) for i in range(a)]) >> >> instead of: >> >> sum(phi(x // ps[i+1], i) for i in range(a)) [Michael Press] > Thank you. That makes it work. But is a wrong solution ;-) As others have suggested, it's almost certainly the case that you were using a too-old version of Python. 2.5 is current, and 2.4.4 (the current bugfix release in the 2.4 line) would also work. You must have been using a version before 2.4, and if you're just starting with Python, it's a Bad Idea to artificially burden yourself with an old version where current examples can't work ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Decimal() instead of float?
[Michael B. Trausch] >> Let's say that I want to work with the latitude 33.6907570. In Python, >> that number > can not be stored exactly without the aid of >> decimal.Decimal(). >> >> >>> 33.6907570 >> 33.6907568 >> >>> >> >> As you can see, it loses accuracy after the 6th decimal place. [Terry Reedy] > No, it loses accuracy in the 15th decimal place, which is the 17th digit. > This is far more accurate than any measured latitude could be. > If the above is a measurement, it represents anything from 33.69075695 > to 33.69075705. The conversion error would have to be multiplied by some > number of millions before it would be visible in the 7th decimal place > (your last). > > Any trigonometric calculations will use an *approximate* value of pi and > polynomial *approximations* of the theoretical functions. The trend in "modern" math libraries is actually that sin(), cos() and tan() return a result strictly less than 1 ULP off from the true mathematical result (including "acting as if" they used an infinitely precise value for pi). It remains unclear what actual benefit there is in, e.g., treating a sloppy argument like 2.31e273 "as if" it were infinitely precise and reducing it by an infinitely precise value of pi, /except/ that it allows making precise statements about the accuracy of functions doing so. In effect, this way all the blame gets pinned on the user if they don't like the results ;-) > Perhaps you have been given impossible conditions to fulfill. I'm sure it's inconceivable to any programmer that management would have an unreasonable expectation. -- http://mail.python.org/mailman/listinfo/python-list
Re: Is there a commas-in-between idiom?
]Ernesto García García] > it's very common that I have a list and I want to print it with commas > in between. How do I do this in an easy manner, whithout having the > annoying comma in the end? > > > > list = [1,2,3,4,5,6] > > # the easy way > for element in list: >print element, ',', > > print > > > # this is what I really want. is there some way better? > if (len(list) > 0): More idiomatic as if len(list) > 0: and even more so as plain if list: >print list[0], >for element in list[1:]: > print ',', element, Do you really want a space before and after each inter-element comma? > An often-overlooked alternative to playing with ",".join() is: print str(list)[1:-1] That is, ask Python to change the list into a big string, and just strip the brackets off each end: >>> alist = ['a, bc', 4, True] >>> print str(alist)[1:-1] 'a, bc', 4, True Note the quotes around the string element! This differs from what your code snippet above would produce (in addition to differing wrt spaces around inter-element commas): a, bc , 4 , True -- http://mail.python.org/mailman/listinfo/python-list
Re: ZODB: single database, multiple connections
[Petra Chong] > I am using Python 2.3 and ZODB (without the rest of Zope) with the > following pattern: > > * One process which writes stuff to a ZODB instance (call it test.db) > * Another process which reads stuff from the above ZODB instance > test.db > > What I find is that when the first process writes, the second doesn't > pick up the changes. I am sure this must be because I am using ZODB > wrongly, but I can't find any examples that show how to use ZODB in > this way, and I can't find any API docs for FileStorage, Connection, > etc. Reading the source code (from C:\python23\lib\site-packages) has > not thrown up anything useful. > > Here's my test code: > > A simple database class: > > ... > > Write: > > ... > > Read: > > db = Database("test.db", read_only = True) > > data = db.get_dictionary('data') > > If I have a Python shell open and run the above two lines, if I run the > write process repeatedly, the above "data" object never contains any of > the newly added items. To pick them up I have to totally recreate the > "db" object. You say that like it's hard to do ;-) It's a decent way to proceed. ZODB is a database, and has transactional semantics: you never see new object state on the read side because you're /in/ a transaction, and a transaction guarantees to give you a consistent view of the data. The view would be inconsistent if it showed you state committed by different transactions on the write side while you're still in the same transaction on the read side. > I must be doing something wrongly, but I can't figure out what. Seems to be a conceptual problem more than anything else. > Any suggestions? You already know that tossing your connection and opening a new connection will give you a newer view of the database, and it's unclear why you don't think that's good enough. Other approaches amount to telling ZODB (on the read side) that you're done with the current transaction. For example, try doing transaction.abort() on the read side when you're ready to see newer object state. BTW, a better place to ask about ZODB is the zodb-dev list: http://mail.zope.org/mailman/listinfo/zodb-dev It's not just for developers /of/ ZODB. Note that you need to subscribe to it in order to post to it (that's a heavyweight anti-spam gimmick common to all Zope lists). -- http://mail.python.org/mailman/listinfo/python-list
Re: Dealing with multiple sets
[John Henry] > If I have a bunch of sets: > > a = set((1, 2, 3)) > b = set((2, 3)) > c = set((1, 3)) > > > What's the cleanest way to say: > > 1) Give me a list of the items that are in all of the sets? (3 in the > above example) list(a & b & c) > 2) Give me a list of the items that are not in all of the sets? (1,2 in > the above example) list((a | b | c) - (a & b & c)) -- http://mail.python.org/mailman/listinfo/python-list
Re: File read-write mode: problem appending after reading
[Frederic Rentsch] > Thanks a lot for your input. I seemed to notice that everything > works fine without setting the cursor as long as it stops before the end > of the file. Is that also a coincidence that may not work? "if you want to read following a write, or write following a read, on the same stream, you must perform a file-positioning operation (typically a seek) between them" -- http://mail.python.org/mailman/listinfo/python-list
Re: File read-write mode: problem appending after reading
[Frederic Rentsch] >Working with read and write operations on a file I stumbled on a > complication when writes fail following a read to the end. > > >>> f = file ('T:/z', 'r+b') > >>> f.write ('abcdefg') > >>> f.tell () > 30L > >>> f.seek (0) > >>> f.read () > 'abcdefg' > >>> f.flush () # Calling or not makes no difference > >>> f.write ('abcdefg') Nothing is defined about what happens at this point, and this is inherited from C. In standard C, if you want to read following a write, or write following a read, on the same stream, you must perform a file-positioning operation (typically a seek) between them. > Traceback (most recent call last): > File "", line 1, in -toplevel- > f.write ('abcdefg') > IOError: (0, 'Error') That's one possible result. Since nothing is defined, any other outcome is also a possible result ;-) > Flushing doesn't help. Right, and because flush() is not a file-positioning operation. > I found two work arounds: > > >>> f.read () > 'abcdefg' > >>> f.read () # Workaround 1: A second read (returning an empty string) > '' > >>> f.write ('abcdefg') > (No error) Purely an accident; may or may not "work" the next time you try it, or on another platform; etc. > >>> f.read () > 'abcdefg' > >>> f.seek (f.tell ()) # Workaround 2: Setting the cursor (to where it is!) That's a correct approach. f.seek(0, 1) (seek 0 bytes from the current position) is a little easier to spell. > >>> f.write ('abcdefg') > (No error) > > I found no problem with writing into the file. So it looks like it has > to do with the cursor which a read puts past the end, unless it is past > the end, in which case it goes back to the end. Is there a less kludgy > alternative to "fseek (ftell ())"? As above, you need to seek when switching from reading to writing, or from writing to reading. -- http://mail.python.org/mailman/listinfo/python-list
Re: hundreds of seconds?
[EMAIL PROTECTED] > ... > G5-fiwihex:~ eur$ python > Python 2.3.5 (#1, Mar 20 2005, 20:38:20) > [GCC 3.3 20030304 (Apple Computer, Inc. build 1809)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> import time > >>> time.time() > 1160580871.258379 > >>> > > My G5 has lots of digits behind the decimal point, my measuring PC runs > W98. We'll see how it does there. But I trust it to be enough digits. On Windows 98, time.time() typically updates only once per 0.055 seconds (18.2 Hz), but time.clock() typically updates more than a million times per second. You do /not/ want to use time.time() for sub-second time measurement on Windows. Use time.clock() for this purpose on Windows. -- http://mail.python.org/mailman/listinfo/python-list
Re: Python license question
[Martitza] > Mr. Peters: Na, my father's dead -- you can call me Uncle Timmy ;-) > Thank you for so kindly taking the time to resolve my misunderstandings > and to elaborate on the intent of the PSF. > > In particular, thank you for explaining in plain language how the > licenses stack. I'm sure our counsel will figure out what a license > from a defunct BeOpen means and anything we do will be in compliance > with all of the license stack. I don't know BeOpen.com's legal status (for example, I don't know whether a bankruptcy judgement was issued). CWI is a Dutch national research institute, and CNRI and the PSF are both US 501(c)(3) public charities -- BeOpen.com was the only for-profit entity in Python's licensing history. It's quite intentional that the top three licenses on the stack "look pretty much alike" -- if I had my way, there would be only one license, but the parties involved couldn't agree to that at the time. While at least the PSF will pursue licence violations, the license is so permissive that there hasn't yet been any need for that. To the best of my knowledge, BeOpen.com, CNRI, and CWI have never had license complaints against anyone's use of Python either. -- http://mail.python.org/mailman/listinfo/python-list
Re: Python license question
[Martitza] |> Hi. I work for a small company (actually in process of forming) > interested in embedding or extending python as part of our commercial > non-open-source product. We have legal counsel, but are interested in > the spirit as well as the letter of the law. Not much seems to have > been written about the python license since version 2, so pointers to > more recent discussions or contacts are appreciated. If this is not > the right place to ask these questions, I would welcome better ideas. > > We've read the license and "layman's language" at > http://www.python.org/psf/license/ and are need help reconciling the > two. There's also the informal license FAQ: http://wiki.python.org/moin/PythonSoftwareFoundationLicenseFaq I'm not a lawyer, so can't give legal advice. I can explain intent, speaking as a Director of the Python Software Foundation. > First Observation > The entire license -- such as produced by license() -- includes a lot > of information which appears to be historical or anecdotal, not the > kind of language our lawyers usually see in licensing agreements. > Lawyers claim they don't like "extra" language. You be the judge. :) Section A (HISTORY OF THE SOFTWARE) isn't really part of the license. But because the PSF doesn't have the right to /re/license the portions of Python copyright by CWI, CNRI, or BeOpen.com, the history given there can be helpful if anyone is inclined to spend (IMO) too much time worrying about the licenses. > First Question > Is all of that information really a required part of the license we > pass on to our customers or is the "PYTHON SOFTWARE FOUNDATION > LICENSE VERSION 2" portion sufficient? Section A is not required. All of section B is required. Parts of Python are copyrighted by, and licensed by, the four entities whose licenses appear in section B. > Second Observation > Referring to sections 2 and 3 of the PSF License Version 2... Our > non-open-source product will be a derived work, either by extending or > embedding the python interpreter. According to section 3, we will > briefly summarize these modifications to the python interpreter as > these relate to our product. Good. Indeed ;-) Note that there are no specific requirements about detail, and this is really for your benefit: the more clearly you identify the parts that are your original work, the easier it will be to prove it is in fact your work (should that become necessary). The NEWS file that comes with a Python distribution is the PSF's version of meeting this requirement (by the similar terms in the CNRI and BeOpen.com licenses, the PSF is also required to summarize the modifications the PSF makes!). > Section 2 says that we (as Licensee of the original python) have to > include without our modified python a copy of the license. The License > explicitly states in section 1 that it is between the PSF and the end > user. The phrase "end user" appears nowhere in the license. You must mean Licensee. As a Licensee, you're free (heck, encouraged) to make derivative works, but you have no legal authority to put a /different/ license on any part of the Python you started with. In exactly the same way, the PSF has no legal authority to put a different license on the versions of Python the PSF started with. When you distribute your derived work, your "end users" have rights to use the PSF-copyrighted portion of your work by exactly the same means you have: they also become Licensees of the PSF. And of CNRI, and of BeOpen.com, and of CWI. The license you put on your derived work as a whole only controls the portions of your derived work original with you, and your derived work as a whole. You can neither diminish nor increase the rights your end users have wrt any of the Python code you obtained from the PSF -- that's not your work, you don't have and can't get copyright on it, and you and your end users have exactly the same rights with respect to it (as spelled out in the license file you're asking about). > At the moment, we are the end user. But when we sell our > software as a derived work, our customer becomes the end user. So our > customers are entering into a direct agreement with PSF. This > indemnifies the PSF (sections 4,5,6,7) -- also good. > > But there is a side effect of section 2 that would seem to give our > customers many rights ("to reproduce, analyze, test, perform and/or > display publicly, prepare derivative works, distribute...") to the > derived work we created. I think the licenses are clear about this. For example, the PSF license at the top of the stack specifically says it only applies to "Python 2.4 alone or in any derivative version". The phrases "alone" and "in any derivative version" intend to modify "Python 2.4". It's not intended to be read as the two separate phrases "Python 2.4 alone" or "in any derivative version", and I'm not sure it even makes sense to try to read it that way. > We would have a problem with our cus
Re: Python to use a non open source bug tracker?
[Giovanni Bajo] > I understand your concerns, but I have to remember you that most bug reports > submitted by users go totally ignored for several years, or, better, forever. > I > do not have a correct statistic for this, Indeed you do not. > but I'm confident that at least 80% of the RFE or patches filed every week > is totally ignored, and probably at least 50% of the bugs too. None are /totally ignored/ -- indeed, at least I see every one as it comes in. You might want to change your claim to that no work obviously visible to you is done on them. That would be better. But, in fact, most bugs and patches are eventually closed, and many that stay open involve such obscure platform-dependent mysteries that nobody with sufficient platform expertise to resolve them appears to exist. For example, if you'd prefer, I'll assign all bugs and patches involving threads on HP-UX to you from now on ;-) These are the actual stats as of a few minutes ago: Bugs: 938 open of 7169 total ~= 87% closed Patches: 429 open of 3846 total ~= 89% closed Feature Requests: 240 open of 479 total ~= 50% closed > I think there is a much bigger problem here wrt QOS. Well, you're confident that 80% of patches are ignored. In reality, 89% of all patches ever submitted have been pursued to final resolution. Call me a stickler for detail, but something just doesn't jibe there to my eyes ;-) There's an easy way to improve these percentages dramatically, although they're not bad as-is: run thru them and close every one that isn't entirely clear. For example, reject every feature request, close every patch that changes visible behavior not /clearly/ fixing a bona fide bug, and close every "bug report" that's really a feature request or random "but Perl/Ruby/PHP doesn't do it this way" complaint in disguise. The Python developers tend to keep a report open if there's a scant non-zero chance that somebody, someday, might appear who's motivated enough to make something of it. If the goal was instead to make the percentages "look good", they could easily and justifiably be dramatically "improved" before today ends. For example, the oldest patch open today is a speculative implementation of rational numbers for Python. This is really a feature request in disguise, and has very little chance-- but not /no/ chance --of ever being accepted. The oldest bug open today is from 6 years ago, and looks like an easy-to-answer /question/ about the semantics of regular expressions in Python 1.6. I could take time to close that one now, but is that a /good/ use of time? Yes, but, at the moment, even finishing this reply seems to be a /better/ use of my time -- and after that, I'm going to get something to eat ;-) Note that I don't mean to claim that turnaround time on bugs and patches is ideal. To the contrary, if it's /my/ bug or patch I'm looking at it, turnaround time sucks, and if you're looking at yours, likewise for you. That's what happens when there are thousands of "you"s and a handful of "them"s, all of the latter volunteering "spare time". OTOH, turnaround time on Python bugs classified as critical is superb. -- http://mail.python.org/mailman/listinfo/python-list
Re: Question about turning off garbage collection
[David Hirschfield] > Question from a post to pygtk list...but it probably would be better > answered here: > > I encountered a nasty problem with an external module conflicting with > my python threads recently, and right now the only fix appears to be to > turn off garbage collection while the critical code of the thread is > running, and then turn it back on afterwards. > > Now, I don't know much about how the garbage collector works in python, > and in order to get the thread to run without freezing, I'm wrapping the > threaded processing function with calls to gc.disable()/gc.enable(). > > So what's that going to do? Stop the /cyclic/ garbage-collection algorithm from running for as long as it remains disabled. Most garbage collection in Python, in typical applications most of the time, is done by reference counting, and /that/ can never be disabled. Reference counting alone is not strong enough to detect trash objects in cycles (like A points to B and B points to A and nothing else points to either; they're unreachable trash then, but the reference count on each remains 1). The `gc` module controls cyclic garbage collection, which is a distinct subsystem used to find and reclaim the trash cycles reference counting can't find on its own. > Will calling gc.enable() put things in good shape? Will all objects created > while > the garbage collector was off now be un-collectable? No. It has no effect except to /suspend/ running the cyclic gc subsystem for the duration. Trash related to cycles will pile up for the duration. That's all. > I'm extremely wary of this solution, as I think anyone would be. I don't want > a > suddenly super-leaky app. As above, most garbage should continue to get collected regardless. > Comments? Suggestions? (I know, I know, avoid threads...if only I could) Nothing wrong with threads. My only suggestion is to dig deeper into /why/ something goes wrong when cyclic gc is enabled. That smells of a serious bug, so that disabling cyclic gc is just papering over a symptom of a problem that will come back to bite you later. For example, if some piece of code in an extension module isn't incrementing reference counts when it should, that could /fool/ cyclic gc into believing an object is trash /while/ the extension module believes it has a valid pointer to it. If so, that would be a serious bug in the extension module. Enabling cyclic gc should not create problems for any correctly written C code. -- http://mail.python.org/mailman/listinfo/python-list
Re: weakSet
[EMAIL PROTECTED] > Has anyone ever think about a set wich references its elements weakly ? Yes, and there are excruciating subtleties. I only implemented as much of one as ZODB needed at the time: # A simple implementation of weak sets, supplying just enough of Python's # sets.Set interface for our needs. class WeakSet(object): """A set of objects that doesn't keep its elements alive. The objects in the set must be weakly referencable. The objects need not be hashable, and need not support comparison. Two objects are considered to be the same iff their id()s are equal. When the only references to an object are weak references (including those from WeakSets), the object can be garbage-collected, and will vanish from any WeakSets it may be a member of at that time. """ For the rest ;-), including discussion of excruciating subtleties that arise even implementing this little, see: http://svn.zope.org/ZODB/trunk/src/ZODB/utils.py -- http://mail.python.org/mailman/listinfo/python-list
Re: Python to use a non open source bug tracker?
[Ben Finney] >> I don't see why you're being so obtuse [Terry Reedy] > I think name calling is out of line here. Name calling is always out of line on comp.lang.python. Unless it's done by Guido. Then it's OK. Anyone else, just remind them that even Hitler had better manners. That always calms things down again. loving-usenet-despite-that-it's-usenet-ly y'rs - tim -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? Python intermittently stops dead for seconds
[Charlie Strauss] >>> level0: newly created objects >>> level1: objects that survived 1 round of garbage collection >>> level2: objects that survivied 2+ rounds of gargbage collection >>> >>> Since all of my numerous objects are level2 objects, and none of >>> them are every deallocated, then I should never trip the GC for >>> these. [Fredrik Lundh] >> your understanding of generational GC is a bit fuzzy, it seems. >> that an object is promoted to a higher level means that it's >> not inspected quite as often as lower-level objects, not that it's >> never inspected at all. [Charlie] > As I understand it, level2 (and level1) objects only undergo gc when > more than 10 of them is deallocated. Since I never deallocate, this > should not be tripped right? No. Cyclic gc is triggered by an excess of allocations over deallocations. > In any case regarding your other comments: >>> Could you clarify that for me. GC really has three components >>> two it: 1) finding and freeing unrefernced memory by refer >>> refer counts 2) cycle removal and 3) defragementing the >>> storage stack. If I turn off GC, don't I lose all of these? >> CPython always does (1), only does (2) if cycle-breaking GC isn't >> disabled, and never does (3). > Never does 3? Correct. > then how does it compact it's memory allocation area? It doesn't. > Surely it does not rely on the OS to manage every small object as a > separate memory allocation. It doesn't do that either. Python has its own small-object allocator, carving up 256KB chunks obtained from the system malloc(). It's based on size-segregated "pools" with extremely low bookkeeping overhead, and external fragmentation in small-object memory is essentially non-existent because of that (although it's possible to concoct pathological programs that exhibit it). > And just to be clear: are you saying that when I do a gc.disable this > only turns off 2 and not 1? Right. Refcounting (#1) can never be disabled, and cyclic GC (#2) is used only for trash objects that can't be reclaimed by #1 (meaning container objects in cycles). -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? Python intermittently stops dead for seconds
[charlie strauss] > Steve, digging into the gc docs a bit more, I think the behaviour I am seeing > is still > not expected. Namely, the program I offered has no obvious place where > objects > are deallocated. The way GC is supposed to work is thate there are three > levels of > objects > > level0: newly created objects > level1: objects that survived 1 round of garbage collection > level2: objects that survivied 2+ rounds of gargbage collection Yes. > Since all of my numerous objects are level2 objects, No. All newly created objects start in level0. Over time, most (but never all) of your objects end up in level 2. > and none of them are every deallocated, then I should never trip the GC for > these. Cyclic gc scans /all/ container objects at or under level N, whenever cyclic gc runs. N varies from run to run of cyclic gc, according to the scheme described in the docs for the gc.set_threshold() function. N=2 is certainly a possible value. There is never a time when a live container object becomes exempt from all future runs of cyclic gc. If a live container object survives two rounds of cyclic gc, it ends up lin level2. That doesn't mean it's never looked at again, it just means it's not looked at during level0 or level1 (N=0 or N=1) runs of cyclic gc. It will still be looked at during all level2 (N=2) runs of cyclic gc. > Your explanation would require this to be tripped so I can't explain it. For > your > explanation to be correct then there as to be some non-obvious step in the > program > that is deallocating level2 items in sufficient numbers to trip the GC. Deallocations never trigger cyclic gc. As the docs say, cyclic gc is triggered by an /excess/ of allocations /over/ deallocations. So, e.g., if you delete container objects just as fast as you create them, cyclic gc will never run. But that's not what you're doing. Instead you're allocating new objects but never deallocating them. That makes cyclic gc run as frequently as it's possible for it to run. -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? Python intermittently stops dead for seconds
[charlie strauss] > I want to clarify that, on my computer, the first instance of the gap occurs > way > before the memory if filled. (at about 20% of physical ram). Additionally the > process monitor shows no page faults. Python has no idea of how much RAM you have, or even of how much RAM it's using. See the `gc` module docs, function set_threshold(), for a description of when Python decides to run a cyclic gc pass. That's all about the current excess (if any) of the number of container allocations over the number of container deallocations since the last time cyclic gc ran. > ... > (note that the array if filled as [1]*10, so there is actually only one > "integer", > but 10 array elements refering to it, per foo class.) Nevertheless cyclic gc has to examine all 10 array elements each time the list is scanned. > ... > For example when I write > > me.memory = [1]*nfoo > > perhaps internally, python is allocating an array of size foo and then > __copying__ it > into me.memory??? No. Assignments in Python never copy anything. > Since there is no reference to the intermediate it would then marked for > future > garbage collection. Nothing in Python is "marked for future garbage collection". From time to time /all/ container objects are scanned to determine whether any have become cyclic trash. This takes time proportional to the total number of containees across all container objects. ... > foo was an election ballot holding 10 outcomes, and bar was a set of 100 > ballots > from 100 voting machines, and the total array was holding the ballot sets > from a few > thousand voting machines. > > Almost any inventory program is likely to have such a simmilar set of nested > array, > so it hardly seems unusual. For memory-consumption reasons alone, a serious such program is likely to use array.array objects at leaf level instead. array.array objects hold raw bits, not Python objects. Since they don't hold Python objects, they can't possibly be part of a cycle, so cyclic gc doesn't need to scan array.array contents either. -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? Python intermittently stops dead for seconds
[Steve Holden, "pins the blame" for pauses on periodic cyclic gc] > ... > So basically what you have here is a pathological example of why it's > sometimes wise to disable garbage collection. Tim, did I miss anything? Nope! -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? Python intermittently stops dead for seconds
[charlie strauss] >>> Below is a simple program that will cause python to intermittently >>> stop executing for a few seconds. it's 100% reproducible on my >>> machine. [Giovanni Bajo] >> Confirmed with Python 2.4.2 on Windows. [Jorgen Grahn] > And Python 2.3.5 on Linux, amd64. In fact, it causes heavy swapping so it > will disrupt other processes, too. > > I didn't read the code, stupid as I am, but I trust that the behavior > doesn't match what the code actually tries to do. No, that's what it does: as it goes on, it creates an ever-increasing and unbounded number of immortal objects and lists. The "time burps" merely reflect that the more live containers and objects you have, the longer it takes for cyclic gc to determine that they're not trash. -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? Python intermittently stops dead for seconds
[charlie strauss] > Below is a simple program that will cause python to intermittently > stop executing for a few seconds. it's 100% reproducible on my machine. Any program that creates a great many long-lived container objects will behave similarly during the creation phase. Others have explained why. Here's a simpler example: from time import time xs = [] i = 0 while 1: i += 1 s = time() xs.append([[1, 2] for dummy in xrange(1000)]) f = time() if f-s > 0.25: print "time", f-s, "on try", i if i % 100 == 0: print "xs made:", i -- http://mail.python.org/mailman/listinfo/python-list
Re: Installing 2.5 with 2.4?
[Duncan Booth] >> Windows is only smart enough to avoid duplicate entries if you tell it >> to do that. e.g. >> >> PATH c:\python25;c:\python25\scripts;%PATH:c:\python25;c:\python25\scripts;=% >> >> will add the two Python 2.5 folders to the head of the path without >> duplicating them. [John Machin[ > Wow .. I didn't know that! What's the syntax? Something like > %variablename[:oldtext=[newtext]]% > ? Yup. > Where is this documented? >From a DOS box (cmd.exe), enter set /? -- http://mail.python.org/mailman/listinfo/python-list
Re: Hardlinks on NTFS
[Wildemar Wildenburger] >> I'm thinking of letting my program create hardlinks (or symlinks). I >> know python allows doing this for ext, reiser and the like, but >> apparently not for ntfs systems. >> Is there any package out there that lets me create links in a platform >> independent way? [Calvin Spealman] > Why isn't NTFS handled the same as any other hardlink supporting > filesystem? Is it a choice or a bug? None of the above. It's not the filesystem, it's what the platform C supports. POSIX specifies a small pile of standard routines for manipulating hard and symbolic links, and Python exposes those /when/ they're available. Microsoft's C supplies none of them. So it's a question of someone on Windows wanting this enough to volunteer to write code, tests, and docs, and volunteer to maintain, a pile of Microsoft-specific code to provide workalike functionality on Windows. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
[Tim Peters] >> [...] The most valuable general technique [Friedl] (eventually ;-) >> explained he called "unrolling", and consists of writing a regexp in >> the form: >> >>normal* (?: special normal* )* >> >> where the sets of characters with which `normal` and `special` can >> start are disjoint. Under most "NFA" implementation, such a regexp is >> immune to most bad behaviors, whether finding very long matches, or in >> searching a very long string that happens not to contain any matches. [Bryan Olson] > So how general is that? Between 6 and 7 ;-) > The books just says "certain common expressions". I think the real > answer is that one can unroll exactly the true regular expressions. I'm sure he didn't have that much in mind; e.g., it's a real strain to fit a fat alternation of fixed strings into that "pattern" (cat|dog|wombat|weasel|...).' > The unrolling is essentially resolving the states of a DFA by hand. Oh yes, for those who've studied the theoretical underpinnings. Most people have not. Much of what he gives as practical advice amounts to ways to "trick" an "NFA" engine into doing what a "DFA" engine would do, but explained from the POV of how a backtracking search engine works. And it makes /sense/ at that level too, divorced from any knowledge of how a linear-time automaton might be constructed in theory; e.g., you "want to" give the search engine only one choice because you want to avoid the possibility of needing to backtrack later, not because you're picturing a theoretical linear-time automaton and striving to mimic its behavior. That explanation may not work for you, but it's effective for people who haven't studied the theory here. Such explanations also extend naturally to gimmicks like backreferences, which have no counterpart in CompSci regexps. -- http://mail.python.org/mailman/listinfo/python-list
Re: min() & max() vs sorted()
[MRAB] >>> Some time after reading about Python 2.5 and how the built-in functions >>> 'min' and 'max' will be getting a new 'key' argument, I wondered how >>> they would treat those cases where the keys were the same, for example: >>> >>> L = ["four", "five"] >>> print min(L, key = len), max(L, key = len) >>> >>> The result is: >>> >>> ('four', 'four') [Tim Peters] >> min() and max() both work left-to-right, and return the minimal or >> maximal element at the smallest index. [MRAB] > It doesn't say that in the documentation. Right, the /language/ doesn't define anything about which specific minimal or maximal element is returned. Since you specifically mentioned Python 2.5, I figured you were asking about CPython -- which has always behaved in the way I explained. >>> I would've thought that min(...) should return the same as >>> sorted(...)[0] (which it does) >> It does, but only because Python's sort is stable, so that minimal >> elements retain their original relative order. That implies that the >> minimal element with smallest original index will end up at index 0 >> after sorting. >>> and that max(...) should return the same as sorted(...)[-1] (which it doesn't). >> Right -- although I don't know why you'd expect that. > Strings have index(), find(), etc which work left-to-right and > rindex(), rfind(), etc which work right-to-left. > > Lists have index() but not rindex(). > > I just thought that if min() and max() work left-to-right then for > completeness there should also be rmin() and rmax(); alternatively, > min() should return sorted()[0] and max() should return sorted()[-1] > for symmetry (my personal preference). If you were to make either of those a feature request, I don't expect they'd gain traction -- I expect "who cares?" would be the common challenge, and "I do" wouldn't silence it ;-) Compelling use cases sometimes work to get a new feature, but "completeness" or "symmetry" almost never do on their own (they function more as sanity checks on proposed solutions to use cases). -- http://mail.python.org/mailman/listinfo/python-list
Re: how do you convert and array of doubles into floats?
[Marc 'BlackJack' Rintsch] >> What about: >> >> b = array.array('f', a) [Diez B. Roggisch] > AFAIK d and f are synonym for arrays, as python doesn't distinguish > between these two on a type-level. And double it is in the end. While Python has no type of its own corresponding to the native C `float`, the `array` and `struct` modules do understand the native C `float` . A Python float (same as a C `double`) gets converted to a C `float` when stored into one of those, and a C `float` is converted to a Python float (C `double`) when a value is extracted. >>> from array import array >>> x = 1.01 >>> x 1.01 >>> array('d', [x])[0] # full precision is preserved 1.01 >>> array('d', [x])[0] == x True >>> array('f', [x])[0] # trailing bits are lost 1.0 >>> array('f', [x])[0] == x False -- http://mail.python.org/mailman/listinfo/python-list
Re: min() & max() vs sorted()
[MRAB] > Some time after reading about Python 2.5 and how the built-in functions > 'min' and 'max' will be getting a new 'key' argument, I wondered how > they would treat those cases where the keys were the same, for example: > > L = ["four", "five"] > print min(L, key = len), max(L, key = len) > > The result is: > > ('four', 'four') min() and max() both work left-to-right, and return the minimal or maximal element at the smallest index. > I would've thought that min(...) should return the same as > sorted(...)[0] (which it does) It does, but only because Python's sort is stable, so that minimal elements retain their original relative order. That implies that the minimal element with smallest original index will end up at index 0 after sorting. > and that max(...) should return the same as sorted(...)[-1] (which it > doesn't). Right -- although I don't know why you'd expect that. > I think that's just down to a subtlety in the way that 'max' is written. It's straightforward, skipping error cases: def max(*args): is_first_arg = True for arg in args: if is_first_arg: winner = arg is_first_arg = False elif arg > winner:# only difference in min() is "<" here instead winner = arg return winner -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
[Bryan Olson] >>> Unfortunately, the stuff about NFA's is wrong. Friedl's awful >>> book [Tim Peters] >> Strongly disagree: [...] [Bryan] > I know I'm disagreeing with a lot of smart people in panning > the book. That's allowed :-) >>> What Python uses is search-and-backtrack. Unfortunately such >>> engines don't have much theory behind them, and it's hard to >>> reason generally about what they do. >> Yup X 3, and the last is precisely why Friedl's book is valuable for >> people using "NFA" implementations: Friedl does a good job of >> explaining when and why you're likely to trigger atrocious runtime >> performance, and gives practical general techniques for avoiding those >> problems. None of that has anything to do with CompSci regexps >> either, but is vital knowledge for people hoping to make happy >> non-trivial use of Python/Perl/etc regexps. > I read the first edition, minus some engine-specific stuff. > General techniques are what I want, and I didn't see them in the > book. He had things to do and not do, but it didn't add up to > building (ir-)regular expressions with reliably tractable > behavior. My guess then is that the book moved at too slow a pace for you, and you got bored. The most valuable general technique he (eventually ;-) explained he called "unrolling", and consists of writing a regexp in the form: normal* (?: special normal* )* where the sets of characters with which `normal` and `special` can start are disjoint. Under most "NFA" implementation, such a regexp is immune to most bad behaviors, whether finding very long matches, or in searching a very long string that happens not to contain any matches. Without knowing this, under many implementations it's very easy to end up writing a regexp that matches quickly when a short match exists, but "blows up" (stack fault, or some checked internal limit hit) if only a very long match exists; and/or takes even exponential time to fail to match when a match doesn't exist. For example, a direct application is writing a robust regular expression to match C string literals (also one form of Python string literal: double-quote character at each end, and backslash escapes, including backslash-newline to span lines): " [^"\\\n]* (?: \\[\x00-\xff] [^"\\\n]* )* " The "normal case" is that it's just a regular old character: not a double quote, backslash, or newline. The "special case" is a backslash followed by any character whatsoever. The "normal" and "special" cases obviously don't intersect, so the interior of this regexp (i.e., ignoring the double quotes at each end) fits the "unrolling" pattern to a tee. As a result, you can use it with reasonably high confidence. For /why/ that's so, read the rest of his book ;-) In effect, it gives the search engine only one possible course of action at each character in the target string. So long as that's "normal", it has to stay in the "normal" part of the regexp, and as soon as it's "special" it has to leave the "normal" part. The intent is to eliminate any possibility for backtracking, thus ensuring predictable runtime and ensuring that no form of internal backtracking stack needs to be used (let alone hit its limit). The unrolling idea was new to me, and as soon as I learned it I replaced lots of regexps in the Python library with rewritten versions that demonstrably performed very much better -- even allowed closing some bug reports concerning extreme cases (extremely long matches, and unbearable runtime on long strings without any matches). A different lesson to take from this is that nobody is likely to stumble into effective "NFA" techniques based on luck or "common sense". It's a set of obscure & specialized learned skills. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
[Licheng Fang] >> Basically, the problem is this: >> >> >>> p = re.compile("do|dolittle") >> >>> p.match("dolittle").group() >> 'do' ... >> The Python regular expression engine doesn't exaust all the >> possibilities, but in my application I hope to get the longest possible >> match, starting from a given point. >> >> Is there a way to do this in Python? [Bryan Olson] > Yes. Here's a way, but it sucks real bad: > > > def longest_match(re_string, text): > regexp = re.compile('(?:' + re_string + ')$') > while text: > m = regexp.match(text) > if m: > return m > text = text[:-1] > return None If you want to try something like that, note that the match() method accepts optional slice arguments, so the "text = text[:-1]" business can be replaced with much quicker little-integer arithmetic. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
[Licheng Fang] >> Oh, please do have a look at the second link I've posted. There's a >> table comparing the regexp engines. The engines you've tested probably >> all use an NFA implementation. [Bryan Olson] > Unfortunately, the stuff about NFA's is wrong. Friedl's awful > book Strongly disagree: it's an excellent book about the /pragmatics/ of using "regular expressions" as most widely implemented. It's not at all about "regular expressions" in the CompSci sense of the term, which appears to be your complaint. > was the first time I saw this confusion about what NFA is; > I don't know if he originated the mess or just propagated it. As far as I could tell at the time, he originated it. I'm not happy about that either. > "Nondeterministic finite automata" is well defined in theory > of computation. The set of languages accepted by NFA's is > exactly the same as the set accepted by DFA's. And which has very little to do with "regular expressions" as most widely implemented -- gimmicks like backreferences are wholly outside the DFA formalism. > What Python uses is search-and-backtrack. Unfortunately such > engines don't have much theory behind them, and it's hard to > reason generally about what they do. Yup X 3, and the last is precisely why Friedl's book is valuable for people using "NFA" implementations: Friedl does a good job of explaining when and why you're likely to trigger atrocious runtime performance, and gives practical general techniques for avoiding those problems. None of that has anything to do with CompSci regexps either, but is vital knowledge for people hoping to make happy non-trivial use of Python/Perl/etc regexps. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
[Licheng Fang[ > ... > In fact, what I'm doing is handle a lot of regular expressions. I > wanted to build VERY LONG regexps part by part and put them all into a > file for easy modification and maintenance. The idea is like this: > > (*INT) = \d+ > (*DECIMAL) = (*INT)\.(*INT) > (*FACTION) = (*DECIMAL)/(*DECIMAL) > (*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT) > ... ... > > What's inside the sytactically wrong (* and ) is something to be > replaced, and then I wrote a little script to do the string > substitution, to get very long regexps to be compiled. I thought that > might be a way to handle long and complex regexps, and in this course I > encountered the problem with the semantics of '|'. > > My approach may sound desperate and funny, but please, do you have any > good idea as to how to handle long and complex regexps? My best advice is to find a different approach entirely. For example, build a parser using parser technology, and use regexps in that /only/ to do gross tokenization ("string of digits", "decimal point", ...): build the rest with a grammar. Regexps are a brittle tool, best tolerated in small doses. For an "NFA" implementation like Python's, you're likely to experience poor speed when combining many complex regexps, and /especially/ when alternatives are ambiguous wrt prefixes (and yours are, else you wouldn't have a problem with "longest match" versus "some shorter match" to begin with. OTOH, under a "DFA" implementation such as POSIX grep's, you're likely to experience exponential memory requirements (DFA implementations can need to build enormous state machines, tracing out in advance all possible paths through all the regexps applied to all possible input strings). Just sounds to me like the wrong tool for the job. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
[Licheng Fang] > ... > I want to know if there is some way to make Python RE behave like grep > does, Not in general, no. The matching strategies couldn't be more different, and that's both deep and intentional. See Friedl's book for details: http://regex.info/ > or do I have to change to another engine? Yes, if POSIX regexp semantics are what you require. Several years ago there was an extension module for Python supplying POSIX semantics, but I couldn't find anything current just now in a minute of searching. You may be more motivated to search longer ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: GC and security
[Aahz] >> Assuming you're talking about CPython, strings don't really participate >> in garbage collection. Keep in mind that the primary mechanism for >> reaping memory is reference counting, and generally as soon as the >> refcount for an object goes to zero, it gets deleted from memory. [Les Schaffer] > ok so far ... Not really ;-) Refcounting is /a/ "garbage collection" technique, and drawing a distinction between refcount-based reclamation and other ways CPython reclaims memory has no bearing on your question. >> Garbage collection only gets used for objects that refer to other >> objects, so it would only apply if string refcounts are being held by >> GC-able objects. > you lost me by here ... That tends to happen when an irrelevant distinction gets made ;-) > is there something different about string objects than other objects in > Python? Not anything relevant wrt garbage collection. It may be relevant that strings are immutable, since that prevents you from overwriting a string's contents yourself. > are you saying EVERY string in Python stays in memory for the lifetime > of the app? He wasn't, and they don't. >> Also keep in mind, of course, that deleting objects has nothing to do >> with whether the memory gets overwritten... > no chance of being overwritten until deleted, no? True. > and once deleted over time there is some probability of being > overwritten, no? True. Also true if you add the intended "non-zero" qualifier to "probability" ;-) > and i am curious how that works. Purely accidental -- nothing guaranteed -- details can (& do) change across releases. Read obmalloc.c for a tour of the accidents du jour. > it sounds like you are saying once a string, always the same string, in > python. > if thats true, i am glad i know that. Not true, so be glad to forget it. A curious possibility: if you do a debug build of Python, obmalloc.c arranges to overwrite all of an object's memory as soon as the object is reclaimed (by any means, refcounting or otherwise). That wasn't for "security" (faux or otherwise), it's to make it easier to detect buggy C code referencing freed memory. -- http://mail.python.org/mailman/listinfo/python-list
Re: A Sort Optimization Technique: decorate-sort-dedecorate
[/T] >> OTOH, current versions of Python (and Perl) [/F] > just curious, but all this use of (& Perl) mean that the Perl folks have > implemented timsort ? A remarkable case of independent harmonic convergence: http://mail.python.org/pipermail/python-dev/2002-July/026946.html Come to think of it, I don't actually know whether a /released/ Perl ever contained the development code I saw. Looks like it got added to Perl 5.8: http://perldoc.perl.org/sort.html -- http://mail.python.org/mailman/listinfo/python-list
Re: A Sort Optimization Technique: decorate-sort-dedecorate
[Joachim Durchholz] >>>>> Wikipedia says it's going from 2NlogN to N. If a sort is massively >>>>> dominated by the comparison, that could give a speedup of up to 100% >>>>> (approximately - dropping the logN factor is almost irrelevant, what >>>>> counts is losing that factor of 2). [Gabriel Genellina] >>>> In fact it's the other way - losing a factor of 2 is irrelevant, >>>> O(2N)=O(N). The logN factor is crucial here. [Joachim Durchholz] >>> That's just a question of what you're interested in. >>> >>> If it's asymptotic behavior, then the O(logN) factor is a difference. >>> >>> If it's practical speed, a constant factor of 2 is far more relevant >>> than any O(logN) factor. [Tim Peters] >> Nope. Even if you're thinking of base 10 logarithms, log(N, 10) > 2 >> for every N > 100. Base 2 logarithms are actually most appropriate >> here, and log(N, 2) > 2 for every N > 4. So even if the "2" made >> sense here (it doesn't -- see next paragraph), the log(N) term >> dominates it for even relatively tiny values of N. [Joachim Durchholz] > Whether this argument is relevant depends on the constant factors > associated with each term. I'm afraid you're still missing the point in this example: it's not just that Python's (& Perl's) current sorts do O(N*log(N)) worst-case comparisons, it's that they /do/ N*log(N, 2) worst-case comparisons. O() notation isn't being used, and there is no "constant factor" here: the count of worst-case comparisons made is close to exactly N*log(N, 2), not to some mystery-constant times N*log(N, 2). For example, sorting a randomly permuted array with a million distinct elements will require nearly 100*log(100, 2) ~= 100 * 20 = 20 million comparisons, and DSU will save about 19 million key computations in this case. O() arguments are irrelevant to this, and the Wikipedia page you started from wasn't making an O() argument either: http://en.wikipedia.org/wiki/Schwartzian_transform For an efficient ordinary sort function, the number of invocations of the transform function goes from an average of 2nlogn to n; No O() in sight, and no O() was intended there either. You do exactly N key computations when using DSU, while the hypothetical "efficient ordinary sort function" the author had in mind does about 2*N*log(N, 2) key computations when not using DSU. That's overly pessimistic for Python's & Perl's current sort functions, where no more than N*log(N, 2) key computations are done when not using DSU. The /factor/ of key computations saved is thus as large as N*log(N, 2) / N = log(N, 2). O() behavior has nothing to do with this result, and the factor of log(N, 2) is as real as it gets. If key extraction is at all expensive, and N isn't trivially small, saving a factor of log(N, 2) key extractions is /enormously/ helpful. If this is sinking in now, reread the rest of my last reply that got snipped hre. > Roughly speaking, if the constant factor on the O(N) term is 100 and the > constant factor on the O(logN) term is 1, then it's still irrelevant. As above, it's talking about O() that's actually irrelevant in this specific case. > My point actually is this: big-Oh measures are fine for comparing > algorithms in general, but when it comes to optimizing concrete > implementations, its value greatly diminishes: you still have to > investigate the constant factors and all the details that the big-Oh > notation abstracts away. That's true, although the argument here wasn't actually abstracting away anything. You've been adding abstraction to an argument that didn't have any ;-) > From that point of view, it's irrelevant whether some part of the > algorithm contributes an O(1) or an O(logN) factor: the decision where > to optimize is almost entirely dominated by the constant factors. While true in some cases, it's irrelevant to this specific case. More, in practice a factor of O(log(N)) is almost always more important "for real" than a factor of O(1) anyway -- theoretical algorithms hiding gigantic constant factors in O() notion are very rarely used in real life. For example, the best-known algorithm for finding the number of primes <= x has O(sqrt(x)) time complexity, but AFAIK has /never/ been implemented because the constant factor is believed to be gigantic indeed. Theoretical CompSci is full of results like that, but they have little bearing on practical programming. -- http://mail.python.org/mailman/listinfo/python-list
Re: A Sort Optimization Technique: decorate-sort-dedecorate
[Joachim Durchholz] >>> Wikipedia says it's going from 2NlogN to N. If a sort is massively >>> dominated by the comparison, that could give a speedup of up to 100% >>> (approximately - dropping the logN factor is almost irrelevant, what >>> counts is losing that factor of 2). [Gabriel Genellina] >> In fact it's the other way - losing a factor of 2 is irrelevant, >> O(2N)=O(N). The logN factor is crucial here. [Joachim Durchholz] > That's just a question of what you're interested in. > > If it's asymptotic behavior, then the O(logN) factor is a difference. > > If it's practical speed, a constant factor of 2 is far more relevant > than any O(logN) factor. Nope. Even if you're thinking of base 10 logarithms, log(N, 10) > 2 for every N > 100. Base 2 logarithms are actually most appropriate here, and log(N, 2) > 2 for every N > 4. So even if the "2" made sense here (it doesn't -- see next paragraph), the log(N) term dominates it for even relatively tiny values of N. Now it so happens that current versions of Python (and Perl) use merge sorts, where the worst-case number of comparisons is roughly N*log(N, 2), not Wikipedia's 2*N*log(N, 2) (BTW, the Wikipedia article neglected to specify the base of its logarithms, but it's clearly intended to be 2). So that factor of 2 doesn't even exist in current reality: the factor of log(N, 2) is the /whole/ worst-case story here, and, e.g., is near 10 when N is near 1000. A factor of 10 is nothing to sneeze at. OTOH, current versions of Python (and Perl) also happen to use "adaptive" merge sorts, which can do as few as N-1 comparisons in the best case (the input is already sorted, or reverse-sorted). Now I'm not sure why anyone would sort a list already known to be sorted, but if you do, DSU is a waste of time. In any other case, it probably wins, and usually wins big, and solely by saving a factor of (up to) log(N, 2) key computations. > (I'm not on the list, so I won't see responses unless specifically CC'd.) Done. -- http://mail.python.org/mailman/listinfo/python-list
Re: naive misuse? (of PyThreadState_SetAsyncExc)
[EMAIL PROTECTED] >> The documentation for PyThreadState_SetAsyncExc says "To prevent naive >> misuse, you must write your own C extension to call this". Anyone care >> to list a few examples of such naive misuse? [and again] > No? I'll take that then as proof that it's impossible to misuse the > function. That's wise ;-) Stopping a thread asynchronously is in /general/ a dangerous thing to do, and for obvious reasons. For example, perhaps the victim thread is running in a library routine at the time the asynch exception is raised, and getting forcibly ejected from the normal control flow leaves a library-internal mutex locked forever. Or perhaps a catch-all "finally:" clause in the library manages to release the mutex, but leaves the internals in an inconsistent state. Etc. The same kinds of potential disasters accout for why Java deprecated its versions of this gimmick: http://java.sun.com/j2se/1.3/docs/guide/misc/threadPrimitiveDeprecation.html That said, you can invoke PyThreadState_SetAsyncExc() from Python using the `ctypes` module (which will be included in 2.5, and is available as an extension module for earlier Pythons). -- http://mail.python.org/mailman/listinfo/python-list
Re: Misleading error message when opening a file (on Windows XP SP 2)
[Claudio Grondi] > Here an example of what I mean > (Python 2.4.2, IDLE 1.1.2, Windows XP SP2, NTFS file system, 80 GByte > large file): > > >>> f = file('veryBigFile.dat','r') > >>> f = file('veryBigFile.dat','r+') > > Traceback (most recent call last): >File "", line 1, in -toplevel- > f = file('veryBigFile.dat','r+') > IOError: [Errno 2] No such file or directory: 'veryBigFile.dat' > > Is it a BUG or a FEATURE? Assuming the file exists and isn't read-only, I bet it's a Windows bug, and that if you open in binary mode ("r+b") instead I bet it goes away (this wouldn't be the first large-file text-mode Windows bug). -- http://mail.python.org/mailman/listinfo/python-list
Re: time.clock() going backwards??
[Giovanni Bajo[ >> I experimented something very strange, a few days ago. I was debugging an >> application at a customer's site, and the problem turned out to be that >> time.clock() was going "backwards", that is it was sometimes >> (randomically) returning a floating point value which was "less than" the >> value returned by the previous invokation. The computer was a pretty fast >> one (P4 3Ghz I think, running Windows XP), and this happened only between >> very close invokations of time.clock(). [Terry Reed] > I seem to remember this being mentioned before on the list, perhaps by Tim > Peters. Perhaps he will chime in. No real need ;-) BIOS or HAL bug on a multiprocessor (or maybe even hyperthreaded) box is by far the most likely cause (and the only cause ever identified for sure by people who followed up). Python's C code slinging QueryPerformanceCounter is exceedingly simple, and isn't a suspect because of that. It's on the edge of vague possibility that Microsoft's compiler generates non-monotonic code for converting 64-bit integer to double: i.e., Python's C code obviously relies on that if i and j are _int64 satisfying i >= j, then (double)i >= (double)j. If that type-conversion code /is/ monotonic (which is almost certainly so), then the only ways the C code can fail are if the HW counter overflows (in which case time.clock() would "jump" a huge amount), or if the sequence of values returned by QueryPerformanceCounter is itself non-monotonic at times (which is consistent with known BIOS/HAL bugs). -- http://mail.python.org/mailman/listinfo/python-list
Re: random writing access to a file in Python
[Claudio Grondi] > I have a 250 Gbyte file (occupies the whole hard drive space) Then where is Python stored ;-)? > and want to change only eight bytes in this file at a given offset of appr. > 200 > Gbyte (all other data in that file should remain unchanged). > > How can I do that in Python? Same way you'd do it in C, really: f = open(PATH_TO_FILE, "r+b") f.seek(appr. 200 Gbyte) f.write(A_STRING_CONTAINING_YOUR_NEW_8_BYTES) f.close() This depends on your operating system and file system and platform C library supporting seeks to very large offsets. For example, Windows using NTFS does. Try it. Python should complain (raise an exception during the seek() attempt) if your box refuses to cooperate. Use as recent a released version of Python as you can. -- http://mail.python.org/mailman/listinfo/python-list
Re: pickling and endianess
[Chandrashekhar kaushik] > Can an object pickled and saved on a little-endian machine be unpickled > on a big-endian machine ? Yes. The pickle format is platform-independent (native endianness doesn't matter, and neither do the native sizes of C's various integer types). > Does python handle this in some manner , by say having a flag to indicate > the endianess of the machine on which the object was pickled ? And then > using this flag to decide how that data will be unpickled ? Not a flag, no, the pickle format is the same on all platforms. If you need to know details, read pickle.py. -- http://mail.python.org/mailman/listinfo/python-list
Re: _PyLong_FromByteArray
[Dan Christensen] > My student and I are writing a C extension that produces a large > integer in binary which we'd like to convert to a python long. The > number of bits can be a lot more than 32 or even 64. My student found > the function _PyLong_FromByteArray in longobject.h which is exactly > what we need, but the leading underscore makes me wary. Is it safe to > use this function? Python uses it internally, so it better be ;-) > Will it continue to exist in future versions of python? No guarantees, and that's why it has a leading underscore: it's not an officially supported, externally documented, part of the advertised Python/C API. It so happens that I added that function, because Python needed some form of its functionality internally across different C modules. Making it an official part of the Python/C API would have been a lot more work (which I didn't have time for), and created an eternal new maintenance burden (which I'm not keen on regardless ;-)). In practice, few people touch this part of Python's implementation, so I don't /expect/ it will go away, or even change, for years to come. The biggest insecurity I can think of offhand is that someone may launch a crusade to make some other byte-array <-> long interface "official" based on a different way of representing negative integers. But even then I expect the current unofficial functions to remain, since the 256's-complement representation remains necessary for the `struct` module's "q" format, and for the `pickle` module's protocol=2 long serialization format. > Or is there some other method we should use? No. That's why these functions were invented to begin with ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Inconsistency producing constant for float "infinity"
[Tim Peters] >... >> It has a much better chance of working from .pyc in Python 2.5. >> Michael Hudson put considerable effort into figuring out whether the >> platform uses a recognizable IEEE double storage format, and, if so, >> marshal and pickle take different paths that preserve infinities, >> NaNs, and signed zeroes. [Alex Martelli] > Isn't marshal constrained to work across platforms (for a given Python > release), and pickle also constrainted to work across releases (for a > given protocol)? Yes to both. > I'm curious about how this still allows them to "take different > paths" (yeah, I _could_ study the sources, but I'm lazy:-)... Good questions. Pickle first: pickle, with protocol >= 1, has always had a binary format for Python floats, which is identical to the big-endian IEEE-754 double-precision storage format. This is independent of the native C double representation: even on a non-IEEE box (e.g, VAX or Cray), protocol >= 1 pickle does the best it can to /encode/ native doubles in the big-endian 754 double storage /format/. This is explained in the docs for the "float8" opcode in pickletools.py: The format is unique to Python, and shared with the struct module (format string '>d') "in theory" (the struct and cPickle implementations don't share the code -- they should). It's strongly related to the IEEE-754 double format, and, in normal cases, is in fact identical to the big-endian 754 double format. On other boxes the dynamic range is limited to that of a 754 double, and "add a half and chop" rounding is used to reduce the precision to 53 bits. However, even on a 754 box, infinities, NaNs, and minus zero may not be handled correctly (may not survive roundtrip pickling intact). The problem has been that C89 defines nothing about signed zeroes, infinities, or NaNs, so even on a 754 box there was no consistency across platforms in what C library routines like frexp() returned when fed one of those things. As a result, what Python's "best non-heroic effort" code for constructing a 754 big-endian representation actually did was a platform-dependent accident when fed a 754 special-case value. Likewise for trying to construct a native C double from a 754 representation of a 754 special-case value -- again, there was no guessing what C library routines like ldexp() would return in those cases. Part of what Michael Hudson did for 2.5 is add code to guess whether the native C double format /is/ the big-endian or little-endian 754 double-precision format. If so, protocol >= 1 pickle in 2.5 uses much simpler code to pack and unpack Python floats, simply copying from/to native bytes verbatim (possibly reversing the byte order, depending on platform endianness). Python doesn't even try to guess whether a C double is "normal", or an inf, NaN, or signed zero then, so can't screw that up -- it just copies the bits blindly. That's much better on IEEE-754 boxes, although I bet it still has subtle problems. For example, IIRC, 754 doesn't wholly define the difference in storage formats for signaling NaNs versus quiet NaNs, so I bet it's still theoretically possible to pickle a signaling NaN on one 754 box and get back a quiet NaN (or vice versa) when unpickled on a different 754 box. Protocol 0 (formerly known as "text mode") pickles are still a crap shoot for 754 special values, since there's still no consistency across platforms in what the C string<->double routines produce or accept for special values. Now on to marshal. Before 2.5, marshal only had a "text mode" storage format for Python floats, much like protocol=0 pickle. So, as for pickle protocol 0, what marshal produced or reconstructed for a 754 special value was a platform-dependent accident. Michael added a binary marshal format for Python floats in 2.5, which uses the same code protocol >= 1 pickle uses for serializing and unserializing Python floats (except that the marshal format is little-endian instead of big-endian). These all go thru floatobject.c's _PyFloat_Pack8 and _PyFloat_Unpack8 now, and a quick glance at those will show that they take different paths according to whether Michael's native-format-guessing code decided that the native format was ieee_big_endian_format, ieee_little_endian_format, or unknown_format. The long-winded pre-2.5 pack/unpack code is only used in the unknown_format case now. -- http://mail.python.org/mailman/listinfo/python-list
Re: Inconsistency producing constant for float "infinity"
[Peter Hansen] >> I'm investigating a puzzling problem involving an attempt to >> generate a constant containing an (IEEE 754) "infinity" value. (I >> understand that special float values are a "platform-dependent >> accident" etc...) [also Peter] > ... > My guess about marshal was correct. Yup. > The problem (value becoming 1.0) appears when running from .pyc > files. Immediately after the source code is changed, the code works, > since it doesn't unmarshal the .pyc file but just works from the > bytecode freshly compiled in memory. > > This demonstrates what would be the heart of the problem, which I guess > means this is not surprising to almost anyone, but perhaps will be a > wakeup call to anyone who might still be unaware and has code that > relies on constants like 1e producing infinities: It has a much better chance of working from .pyc in Python 2.5. Michael Hudson put considerable effort into figuring out whether the platform uses a recognizable IEEE double storage format, and, if so, marshal and pickle take different paths that preserve infinities, NaNs, and signed zeroes. For Pythons earlier than that, it's a better idea to avoid trying to express special values as literals. That not only relies on platform accidents about how marshal works, but also on platform accidents about the C string->float library works. For example, do instead: inf = 1e300 * 1e300 nan = inf - inf If you're /on/ an IEEE-754 platform, those will produce an infinity and a NaN, from source or fom .pyc. > >>> import marshal > >>> marshal.dumps(1e666) > 'f\x061.#INF' > >>> marshal.loads(marshal.dumps(1e666)) > 1.0 Illustrating the remarkable truth that the Microsoft C float->string routines can't read what the MS string->float routines produce for special values (try reading "1.#INF" from C code with a double format and you'll also get 1.0 back). Here in Python 2.5, on Windows (note that marshal grew a new binary float format, primarily to support this): >>> import marshal >>> marshal.dumps(1e666) 'g\x00\x00\x00\x00\x00\x00\xf0\x7f' >>> marshal.loads(_) 1.#INF >>> float.__getformat__('double') # also new in 2.5 'IEEE, little-endian' -- http://mail.python.org/mailman/listinfo/python-list
Re: datetime to timestamp
[Simen Haugen] >>> How can I convert a python datetime to a timestamp? It's easy to convert >>> a timestamp to datetime (datetime.datetime.fromtimestamp(), but the >>> other way around...?) [John Machin] >> Is the timetuple() method what you want? >> >> #>>> import datetime >> #>>> n = datetime.datetime.now() >> #>>> n >> datetime.datetime(2006, 8, 11, 23, 32, 43, 109000) >> #>>> n.timetuple() >> (2006, 8, 11, 23, 32, 43, 4, 223, -1) [also John] > Aaaarrrggghhh no it's not what you want Yes, it is ;-) > -- looks like you have to do the arithmetic yourself, starting with > toordinal() It's just one step from the time tuple: import time time.mktime(some_datetime_object.timetuple()) The datetime module intended to be an island of relative sanity. Because the range of dates "timestamps" can represent varies across platforms (and even "the epoch" varies), datetime doesn't even try to produce timestamps directly -- datetime is more of an alternative to "seconds from the epoch" schemes. Because datetime objects have greater range and precision than timestamps, conversion is problem-free in only one direction. It's not a coincidence that that's the only direction datetime supplies ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: threading.Event usage causing intermitent exception
[EMAIL PROTECTED] > Admittedly this problem causes no actual functional issues aside from > an occasional error message when the program exits. The error is: > > Unhandled exception in thread started by > Error in sys.excepthook: > Original exception was: > > Yes all that info is blank. That's typical when the interpreter has torn so much of itself down that there's not enough left in the `sys` module even to print exception info gracefully. The easiest way to stop that is to stop /trying/ to run Python code while the interpreter is tearing itself down. > The application is a console application that is waiting for some > condition on the machine to happen. However, I leave open the > possiblitiy to cancel by a single key press at which > point the program terminates. Suffice it to say, I cannot perform both > checks without invoking threads as the key press gets "missed" > sometimes. Below is a simplification of the code > > canceled = False > myEvent = threading.Event() > > def watchForCancel() > global canceled > # turn of terminal buffering and capture key presses here > canceled = True > myEvent.set() Presumably this is some kind of poll-and-sleep loop? If so, add a check to get out of the loop if myEvent.isSet(). Or if not, make it some kind of poll-and-sleep loop ;-) > def watchForCondition() > # do a bunch of stuff checking the system > myEvent.set() Ditto. > cancelThread = threading.Thread(target = watchForCancel) > cancelThread.setDaemon(True) # so I can exit the program when I want to And get rid of that. The comment doesn't make sense to me, and forcing a thread to be a daemon is exactly what /allows/ the thread to keep running while the interpreter is tearing itself down. That's why "daemonism" isn't the default: it's at best delicate. I don't see a real reason for wanting this here. > cancelThread.start() > conditionThread = threading.Thread(target = watchForCondition) > conditionThread.setDaemon(True) Ditto. > conditionThread.start() > > myEvent.wait() > > if cancelled: > sys.exit(2) > > # do more stuff if the condition returned instead of cancel and then > I'm done > > > I've left out most of the active code, just cuz I think it muddies the > water. Now about 9 out of 10 times this works just fine. However, > every once in a while I get the exceptions mentioned above, but only > when I cancel out of the operation. I think the conditionThread is in > the process of shutting down and gets hosed up somehow and spits out an > exception, but the interpreter no longer has access to the info since > it is shutting down. At this point it's likely that even sys.stdout and sys.stderr no longer exist. The "Unhandled exception" message is printed directly to the C-level `stderr` instead. > ... > I suppose I could make the threads aware of each other, but that just > seems stupid. Any suggestions on how to eliminate this intermittent > error? Stop forcing them to be daemon threads. The interpreter won't start to tear itself down then before both threads terminate on their own. To arrange for that, it's not necessary for the threads to become aware of each other, but it is necessary for the threads to become aware of another (shared) reason /for/ exiting. The most natural way to do that, given what you said above, is to make both threads aware that their shared myEvent event may get set "externally", and to stop when they find it has been set. -- http://mail.python.org/mailman/listinfo/python-list
Re: Initializing the number of slots in a dictionary
... [Jon Smirl] > I know in advance how many items will be added to the dictionary. Most > dictionary implementations I have previously worked with are more > efficient if they know ahead of time how big to make their tables. Richard Jones spent considerable time investigating whether "pre-sizing" lists and dicts in CPython could help, at the "Need For Speed" sprint earlier this year. He didn't find a win worth getting; e.g., read the section "List pre-allocation" at: http://wiki.python.org/moin/NeedForSpeed/Failures Trying it for dicts was also part of what he did, but I don't think specifics about that were recorded on the Wiki. I was at the sprint, and hoots of triumph from Richard's direction were conspicuous by absence during his dict time ;-) > In this case I only need to do a presence test of the key, there is no > actual data associated with the key. The table is used for detecting > duplicate entries. Is there a more efficient to do this test that sticking > an empty string into a dict? The keys are sha1.digest(). It's probably more common to set the value to None or 1, but it doesn't really matter in reality. In theory, using 1 or an empty string relies on the implementation accident that CPython stores those uniquely, while it's guaranteed that None is a singleton object. BTW, is there a reason to use SHA instead of MD5? I ask because the latter is 4 bytes shorter, and you apparently have a /lot/ of keys. If you're using a current version of Python, you can save some memory by using a builtin set object instead. The CPython implementation of sets is very much like its implementation of dicts, but doesn't consume any memory for the (non-existent) associated values. You also get a number of operations useful on sets (like intersection and union). > ... > Since I am rerunning the conversion over and over anything simple that speeds > it up is helpful. > > I already have 4GB RAM, it would take around 30GB to get everything in memory. > > Dictionaries are not a big problem for me, but there are many in use and > they have millions of items in them. A peculiar suggestion: if you don't need to release system resources cleanly at the end of a run, try doing: import os os._exit(0) at the end. If you have dicts with millions of entries swapped to disk, it /can/ consume major time just to page them all in again to decrement the key & value refcounts if you let "clean shutdown" code determine they've become trash. Bailing ungracefully can skip all that work (but also skips other work, like letting the platform C I/O library close still-open files gracefully). -- http://mail.python.org/mailman/listinfo/python-list
Re: random shuffles
[ Boris Borcic] > x.sort(cmp = lambda x,y : cmp(random.random(),0.5)) > > pick a random shuffle of x with uniform distribution ? Say len(x) == N. With Python's current sort, the conjecture is true if and only if N <= 2. > Intuitively, assuming list.sort() does a minimal number of comparisons to > achieve the sort, No idea what that could mean in a rigorous, algorithm-independent way. > I'd say the answer is yes. But I don't feel quite confortable > with the intuition... can anyone think of a more solid argumentation ? If a list is already sorted, or reverse-sorted, the minimum number of comparisons needed to determine that is N-1, and Python's current sort achieves that. It first looks for the longest ascending or descending run. If comparison outcomes are random, it will decide "yes, already sorted" with probability 1/2**(N-1), and likewise for reverse-sorted. When N > 2, those are larger than the "correct" probabilities 1/(N!). So., e.g., when N=3, the sort will leave the list alone 1/4th of the time (and will reverse the list in-place another 1/4th). That makes the identity and reversal permutations much more likely than "they should be", and the bias is worse as N increases. Of course random.shuffle(x) is the intended way to effect a permutation uniformly at random. -- http://mail.python.org/mailman/listinfo/python-list
Re: hash() yields different results for different platforms
[Grant Edwards] >> ... >> The low 32 bits match, so perhaps you should just use that >> portion of the returned hash? >> >> >>> hex(12416037344) >> '0x2E40DB1E0L' >> >>> hex(-468864544 & 0x) >> '0xE40DB1E0L' >> >> >>> hex(12416037344 & 0x) >> '0xE40DB1E0L' >> >>> hex(-468864544 & 0x) >> '0xE40DB1E0L' [Qiangning Hong] > Is this relationship (same low 32 bits) guaranteed? No. Nothing about hashes is guaranteed, except that when x and y are of a hashable type, and x == y, then hash(x) == hash(y) too. > Will it change in the future version? That's possible, but not planned. Note that the guts of string hashing in CPython today is implemented via while (--len >= 0) x = (103*x) ^ *p++; where x is C type "long", and the C language doesn't even define what that does (behavior when signed multiplication overflows isn't defined in C). -- http://mail.python.org/mailman/listinfo/python-list
Re: Valgrind memory-checker reports memory problems in Python
[Nathan Bates] > Are the Python developers running Python under Valgrind? Please read Misc/README.valgrind (in your Python distribution). -- http://mail.python.org/mailman/listinfo/python-list
Re: how to stop python...
[bruce] > perl has the concept of "die". does python have anything similar. how can a > python app be stopped? > > the docs refer to a sys.stop. Python docs? Doubt it ;-) > but i can't find anything else... am i missing something... >>> import sys >>> print sys.exit.__doc__ exit([status]) Exit the interpreter by raising SystemExit(status). If the status is omitted or None, it defaults to zero (i.e., success). If the status is numeric, it will be used as the system exit status. If it is another kind of object, it will be printed and the system exit status will be one (i.e., failure). Of course there's nothing to stop you from catching SystemExit either -- it's just another exception, but one that happens to shut down the interpreter in the specified way if it isn't caught & handled. -- http://mail.python.org/mailman/listinfo/python-list
Re: Questions on Threading
[j.c.sackett] > I'm using the threading module to accomplish some distributed processing on > a project, and have a basic (I hope) question that I can't find an answer to > elsewhere. > > I've noted that there's a lot of documentation saying that there is no > external way to stop a thread, True. > and yet when creating a thread through the threading module I can see > that it has a method called _Thread__stop, which does indeed stop it. You're misreading the code then. __stop() is used internally by a thread T to record when T itself knows it's _about_ to stop. Other threads may be blocked in T.join() calls, and the purpose of T calling T.__stop() is to notify such threads they can proceed. > Is this in any way dangerous ( i.e. doesn't actually stop the thread, That's right. > or causes other thread timing issues down the road?) If you call T.__stop(), T will keep running (exactly as if you hadn't called it), but threads waiting on T.join() will think T has stopped. > Or is there some other reason that there is officially no stop method? Python never offered it primarily for the same reasons Java regretted offering it (so much so that Java eventually deprecated it): http://java.sun.com/j2se/1.3/docs/guide/misc/threadPrimitiveDeprecation.html -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Queue.Queue.queue.clear() thread-safe?
[Russell Warren] >>> I'm guessing no, since it skips down through any Lock semantics, [Tim Peters] >> Good guess :-) It's also unsafe because some internal conditions must >> be notified whenever the queue becomes empty (else you risk deadlock). [Fredrik Lundh] > "also" ? if it weren't for the other things, the clear() call itself > would have been atomic enough, right ? Define "other things" :-) For example, Queue.get() relies on that when self._empty() is false, it can do self._get() successfully. That's 100% reliable now because the self.not_empty condition is in the acquired state across both operations. If some yahoo ignores the underlying mutex and clears the queue after self._empty() returns true but before pop() executes self._get(), the call to the latter will raise an exception. That kind of failure is what I considered to be an instance of a problem due to the OP's "skips down through any Lock semantics"; the failure to notify internal conditions that the queue is empty is a different _kind_ of problem, hence my "also" above. If you're just asking whether deque.clear() is "atomic enough" on its own, define "enough" :-) It has to decref each item in the deque, and that can end up executing arbitrary Python code (due to __del__ methods or weakref callbacks), and that in turn can allow the GIL to be released allowing all other threads to run, and any of that Python-level code may even mutate the deque _while_ it's being cleared. The C code in deque_clear() looks threadsafe in the sense that it won't blow up regardless -- and that's about the strongest that can ever be said for a clear() method. dict.clear() is actually a bit stronger, acting as if a snapshot were taken of the dict's state at the instant dict.clear() is called. Any mutations made to the dict as a result of decref side effects while the original state is getting cleared survive. In contrast, if a new item is added to a deque d (via decref side effect) while d.clear() is executing, it won't survive. That's "atomic enough" for me, but it is kinda fuzzy. -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Queue.Queue.queue.clear() thread-safe?
[Russell Warren] > I'm guessing no, since it skips down through any Lock semantics, Good guess :-) It's also unsafe because some internal conditions must be notified whenever the queue becomes empty (else you risk deadlock). > but I'm wondering what the best way to clear a Queue is then. > > Esentially I want to do a "get all" and ignore what pops out, but I > don't want to loop through a .get until empty because that could > potentially end up racing another thread that is more-or-less blindly > filling it asynchronously. > > Worst case I think I can just borrow the locking logic from Queue.get > and clear the deque inside this logic, but would prefer to not have to > write wrapper code that uses mechanisms inside the object that might > change in the future. There's simply no defined way to do this now. > Also - I can of course come up with some surrounding architecture to > avoid this concern altogether, but a thread-safe Queue clear would do > the trick and be a nice and short path to solution. > > If QueueInstance.queue.clear() isn't thread safe... what would be the > best way to do it? Also, if not, why is the queue deque not called > _queue to warn us away from it? "Consenting adults" -- if you want an operation that isn't supplied out of the box, and are willing to take the risk of breaking into the internals, Python isn't going to stop you from doing whatever you like. "mutex" isn't named "_mutex" for the same reason. A q.mutex.acquire() try: q.queue.clear() q.unfinished_tasks = 0 q.not_full.notify() q.all_tasks_done.notifyAll() finally: q.mutex.release() sequence probably works (caveat emptor). BTW, it may be _possible_ you could use the newer task_done()/join() Queue gimmicks in some way to get what you're after (I'm not really clear on that, exactly). Anyway, yes, there is cross-release risk in breaking into the internals like that. That's an "adult decision" for you to make. The only way to get permanent relief is to re-think your approach, or propose adding a new Queue.clear() method and Queue._clear() default implementation. I don't know whether that would be accepted -- it seems simple enough, although since you're the first person to ask for it 15 years :-), you won't get much traction arguing that's a critical lack, and every new bit of cruft becomes an ongoing burden too. -- http://mail.python.org/mailman/listinfo/python-list
Re: Iteration over recursion?
[MTD <[EMAIL PROTECTED]>] > I've been testing my recursive function against your iterative > function, and yours is generally a quite steady 50% faster on > factorizing 2**n +/- 1 for 0 < n < 60. If you're still not skipping multiples of 3, that should account for most of it. > I think that, for a challenge, I'll try to make a recursive function that > matche or beats the iterative function -- it's worth the experiment! Don't stop on that one before you succeed ;-) Provided you make no more than one recursive call per factor, you can't make more than log(n, 2) recursive calls in total, and _typically_ far fewer than that. IOW, the number of recursive calls should generally be trivial, and factoring 2**n will be the worst case. -- http://mail.python.org/mailman/listinfo/python-list
Re: random.jumpahead: How to jump ahead exactly N steps?
[Matthew Wilson] > The random.jumpahead documentation says this: > > Changed in version 2.3: Instead of jumping to a specific state, n steps > ahead, jumpahead(n) jumps to another state likely to be separated by > many steps.. > > I really want a way to get to the Nth value in a random series started > with a particular seed. Is there any way to quickly do what jumpahead > apparently used to do? No known way, and it seems unlikely that any quick way will be found in my lifetime ;-) for the Mersenne Twister. In Pythons <= 2.2, the underlying generator was the algebraically _very_ much simpler original Wichman-Hill generator, and jumping ahead by exactly n steps was just a matter of a few modular integer exponentiations. That took time proportional to log(n), so was extremely fast. It was also much more _necessary_ using W-H, since W-H's period was only around 10**13, while the Twister's period is around 10**6001: even if you're going to use a billion random numbers per sequence, the Twister's period has a way-way-beyond astronomical number of non-overlapping subsequences of that length. The chance of hitting an overlap is correspondingly miniscule. > I devised this function, but I suspect it runs really slowly: Well, it takes exactly as much time as it takes to call random() n times. > def trudgeforward(n): > '''Advance the random generator's state by n calls.''' > for _ in xrange(n): random.random() > > So any speed tips would be very appreciated. What are you trying to achieve in the end? Take it as a fact that there is no known way to advance the Twister by n states faster than what you've already found. -- http://mail.python.org/mailman/listinfo/python-list
Re: Iteration over recursion?
[Kay Schluehr] > You might use a separate prime generator to produce prime factors. The > factorize algorithm becomes quite simple and configurable by prime > generators. Alas, yours was _so_ simple that it always takes time proportional to the largest prime factor of n (which may be n) instead of worst-case time proportional to sqrt(n). Repair that, and it's not quite so simple anymore. The speed of the sieve generator would also benefit a lot by never looking at even integers, beyond an initial "yield 2" ... the OP said he was looking for speed more than elegance, and they're not good buddies here ;-) > def eratosthenes(): > memo = {} > q = 2 > while True: > p = memo.pop(q, None) > if p is None: > yield q > memo[q*q] = q > else: > x = p + q > while x in memo: > x += p > memo[x] = p > q+=1 > > def factorize(n, sieve = eratosthenes): > if n <= 1: > return [n] > factors = [] > primes = sieve() > for q in primes: > while n % q == 0: > factors.append(q) > n //= q > if n == 1: > return factors At _some_ point you might think that's bound to be faster than only skipping multiples of 2 and 3, because it does fewer divisions. But to get to a particular prime p, the sieve there has to go around its outer loop about 3x as often as the trial() function goes around its inner loop, and grows a dict with about p/ln(p) entries (while the trial() function's memory use is constant), and those aren't free. Applied to 991**2 (constructed so that both functions take time proportional to sqrt(n)), on my box the factorize() function took 4x longer than trial() (approximately 16 seconds versus 4). Trial division isn't practical for such large inputs, but I picked the square of a "large" prime to give factorize() maximum advantage in skipping division attempts (by the time we get to a prime p, factorize() tries about p/ln(p) divisions, while trial() does about p/3; so trial() does about ln(p)/3 times as many divisions as factorize(): the larger the p we need, the larger that _factor_ gets). Alas, it appears that made the dict so big that cache faults on dict accesses more than wiped out the division advantage. At the smaller 83**2, factorize() took only about 3.6x longer, despite losing some of its relative division advantage. In short, it's never what you think it is ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Python SHA-1 as a method for unique file identification ? [help!]
[EP <[EMAIL PROTECTED]>] > This inquiry may either turn out to be about the suitability of the > SHA-1 (160 bit digest) for file identification, the sha function in > Python ... or about some error in my script It's your script. Always open binary files in binary mode. It's a disaster on Windows if you don't (if you open a file in text mode on Windows, the OS pretends that EOF occurs at the first instance of byte chr(26) -- this is an ancient Windows behavior that made an odd kind of sense in the mists of history, and has persisted in worship of Backward Compatibility despite that the original reason for it went away _long_ ago). ... > I am trying to reduce duplicate files in storage at home - I have a > large number files (e.g. MP3s) which have been stored on disk multiple > times under different names or on different paths. ... > All seemed to be working until I examined my log files and found files > with the same SHA digest had different sizes according to > os.stat(fpath).st_size . This is on Windows XP. > > - Am I expecting too much of SHA-1? No. SHA-1 should work well for this. > - Is it that the os.stat data on Windows cannot be trusted? It can be trusted to the extent that anything on Windows can be trusted ;-) ... > def hashit(pth): > """Takes a file path and returns a SHA hash of its string""" > fs=open(pth,'r').read() Make that 'rb' instead of 'r', and your problem will go away. Do note that there are faster ways to go about this. For example, if a particular file has a unique size among the files you're looking at, there's no reason to even open that file (let alone compute its hash). Group the files by size, and within a size group you can find duplicates by, e.g., first comparing their initial 16 bytes, then the next 32, then the next 64 ... etc. If duplicates are uncommon, this can be a huge savings. For example, here's output from an old dup-finding Python program of mine run over a directory tree which happens to contain no duplicates: Files - Total 10,718 Unique 10,718 Duplicate0 # w/ unique size10,053 # w/ unique prefix 665 Bytes - Total 1,401,668,015 Unique 1,401,668,015 Duplicate 0 Read 76,688 Excess76,688 That last two lines mean it actually read a total of only about 77,000 file bytes on its way to proving there were no duplicates among about 11,000 files spanning about 1.4 gigabytes. No hashes are computed by that program. I'd use a hash if instead I were going to save a persistent (across runs) set of known hashes, so that answering "here's a new file -- same as one I already have?" could be done by computing its hash. While the program above generally reads "amazingly" few file bytes, it hammers the OS file directory services, and it would go faster to compute a hash of a new file than to run the whole analysis again. -- http://mail.python.org/mailman/listinfo/python-list
Re: Iteration over recursion?
[MTD] > I've been messing about for fun creating a trial division factorizing > function and I'm naturally interested in optimising it as much as > possible. > > I've been told that iteration in python is generally more > time-efficient than recursion. Is that true? Since you heard it from me to begin with, I suppose saying "yes" again won't really help ;-) You can use time.time() or time.clock(), or the `timeit` module, to measure speed. Try timing a dead-simple factorial functions both ways. BTW, I've worked on Python's implementation for close to 15 years now, so when I say something about Python, it's _usually_ safe to just believe it :-) Questioning is fine, but in cases like this where it's easy to find out for yourself, I probably won't make time to respond. > Here is my algorithm as it stands. Any suggestions appreciated! > > from math import * > > def factorize(x): > """ Return factors of x """ > factors = [] > lim = int(sqrt(x)) Here you're limited to integers for which a floating sqrt is accurate. That's 53 bits on most platforms. For integers larger than that, the code may produce incorrect results in mysterious ways at times (because the square root may be inaccurate). > y = divmod(x,2) Most people would use "unpacking assignment" here instead, like q, r = divmod(x, 2) Then later code gets to use meaningful names instead of messing with mysterious little integer subscripts. It's also quicker to use local names than to build subscripting expressions. > if y[1] == 0: # x is even > factors = factors + [2] + factorize(y[0]) Since `factors` upon entry to this block must be [], it's kinda contorted. More obvious (if you _have_ to use recursion) would be: return [2] + factorize(y[0]) and then unindent the rest of the code a level. > else: # x is odd > i = 3 > while i <= lim: > y = divmod(x,i) Again more commonly written: q, r = divmod(x, i) > if y[1] == 0: > factors = factors + [i] + factorize(y[0]) It hardly matters in this context, but appending to a list is _generally_ much faster than building a brand new list from scratch. At a higher level, your recursions always start from 2 again; e.g., if i happens to be 434349 here, the factorize(y[0]) call starts over from 2, despite that you already know that no divisor less than 434349 can succeed. To do this _efficiently_ with recursion requires passing more knowledge across recursion levels. > i = lim+1 > else: > i += 2 > > if factors == []: # necessary step for correct recursion > factors = [x] > > return factors Recursion is pretty bizarre here. Not _wrong_, just "pretty bizarre", and is leading you into higher-level inefficiences. There are all sorts of ways to repair that, but I'm not going to talk about them because it's easy to write an iterative algorithm instead. Here's one way to do that, avoiding float sqrt (this works correctly for integers of any size, although trial division is hopelessly _inefficient_ for finding "large" factors), and skipping multiples of 3 in the inner loop too: def trial(n): if n <= 1: return [n] factors = [] for tiny in 2, 3: while n % tiny == 0: factors.append(tiny) n //= tiny delta = 2 d = 5 while 1: q, r = divmod(n, d) if r: if q < d: break d += delta delta = 6 - delta else: factors.append(d) n = q if n > 1: factors.append(n) return factors If we call the original input N, the key invariants holding at the top of the loop are: 1. N >= 2, and N == n * the product of the integers in `factors` 2. n is not divisible by any prime < d 3. all elements in `factors` are prime It may take a bit of thought to see that d marches over all integers >= 5 that aren't divisible by either 2 or 3: 5, 7, 11, 13, 17, 19, 23, 25, 29, 31, ... It's more-or-less generally true that establishing and proving loop invariants in an iterative algorithm is "the same thing" as establishing and proving pre- and post-conditions in a recursive algorithm. While it may feel strained at first, it's very much worthwhile learning how to do that. For example, from #2 it's easy to prove that if q < d, n must either be 1 or a prime. That's what justifies the "break" statement inside the loop. Then since that's the only way to get out of the loop, n is either 1 or a prime when the loop finishes, and that explains the rest of the code. -- http://mail.python.org/mailman/listinfo/python-list
Re: need helping tracking down weird bug in cPickle
[Carl J. Van Arsdall] > Hey everyone, cPickle is raising an ImportError that I just don't quite > understand. When that happens, the overwhelmingly most likely cause is that the set of modules on your PYTHONPATH has changed since the pickle was first created, in ways such that a module _referenced_ inside the pickle can no longer be found. For example, here's file resourcepmdb.py on my box: def f(): print "hi" Here's a bit of Python code that uses it: import cPickle import resourcepmdb pf = open("junk.pik", "wb") pickler = cPickle.Pickler(pf) pickler.dump([resourcepmdb.f]) pf.close() Pickle stores module globals like user-defined classes and module functions (the case above) not by saving the code they contain, but _just_ by saving strings recording the module's name and the global's name. The unpickling process creates an import out of those strings. So long as resourcepmdb stays on my PYTHONPATH, unpickling junk.pik works fine. For example, this code: import cPickle pf = open("junk.pik", "rb") pickler = cPickle.Unpickler(pf) whatever = pickler.load() pf.close() print whatever prints something like [] But if resourcepmdb gets off my path (for example, resourcepmdb.py gets deleted, renamed, or moved -- or PYTHONPATH itself changes), that same code dies with exactly the error you reported: Traceback (most recent call last): File "pik2.py", line 5, in ? whatever = pickler.load() ImportError: No module named resourcepmdb > ... > I've done some research and it looks like cPickle tries to load some > modules as some kind of test. That's unlikely ;-) > From what I can tell the module that cPickle is trying to load is a > concatenation of the module that it exists in and part of the string > that was sent in the orignal request. Sorry, can't tell from here. > It looks like something where memory is getting overwritten, but I'm not > really doing anything too fancy, so i'm not sure how to approach this > problem. I'm using python2.2. Does anyone know how I might go about > trying to troubleshoot something like this? Is there a good tool or a > better method of error handling I can use to try and track this down? > Or is it better to give up troubleshooting and just upgrade to 2.4? You could at least _install_ a recent Python so you could use the newer pickletools module. pickletools.dis() displays a readable "disassembly" of a pickle, so that there's no need to guess about what the pickle _thinks_ it's trying to do. pickletools won't have any problem working with pickles created by an older Python. > ... > Here's the traceback: > > Traceback (most recent call last): > File "/home/build/bin/resourceManager/resourceQueue.py", line 50, in > initQueue > pickledList = cPickle.load(queueFPtr) > ImportError: No module named resourcepmdb' See above for the most obvious way to get an error like that. -- http://mail.python.org/mailman/listinfo/python-list
Re: PyObject_SetItem(..) *always* requires a Py_INCREF or not?
[EMAIL PROTECTED] > I would think everytime you add an item to a list you must increase > reference count of that item. _Someone_ needs to. When the function called to add the item does the incref itself, then it would be wrong for the caller to also incref the item. > http://docs.python.org/api/refcountDetails.html has an example that > seems to contradict that > > int > set_all(PyObject *target, PyObject *item) > { > int i, n; > > n = PyObject_Length(target); > if (n < 0) > return -1; > for (i = 0; i < n; i++) { > if (PyObject_SetItem(target, i, item) < 0) > return -1; > } > return 0; > } > > *WHY* don't you need a Py_INCREF(item); in the for loop!?!?!? You should take a break, and read that section again later ;-) The _point_ of that example is in fact to illustrate that you don't need to incref when calling PyObject_SetItem(). While I can't know, I'm guessing that you're still "seeing" PyList_SetItem() here, which has radically different behavior. PyList_SetItem() steals a reference to its second argument, but PyObject_SetItem() does not. Read the whole section again from its start, and this should be much clearer the second time through. -- http://mail.python.org/mailman/listinfo/python-list
Re: Numerics, NaNs, IEEE 754 and C99
[Nick Maclaren] Firstly, a FAR more common assumption is that integers wrap in twos' complement - Python does not do that. [Grant Edwards] >>> It used to [Fredrik Lundh] >> for integers ? what version was that ? [Grant] > Am I remebering incorrectly? Mostly but not entirely. > Didn't the old fixed-width integers operate modulo-wordsize? Not in Python. > I distinctly remember having to rewrite a bunch of checksum code when > fixed-width integers went away. Best guess is that you're working on a 32-bit box, and remember this Python <= 2.2 behavior specific to the left shift operator: >>> 1 << 31 -2147483648 >>> 1 << 32 0 On a 64-bit box (except Win64, which didn't exist at the time ;-)), those returned 2**31 and 2**32 instead, while "1 << 64" wrapped to 0 (etc). Python 2.3 starting warning about that: >>> 1 << 31 __main__:1: FutureWarning: x<>> 1 << 31 2147483648L >>> 1 << 32 4294967296L + - * / on short ints always complained about overflow before int-long unification, although IIRC very early Pythons missed the overflow error in (on a 32-bit box) int(-(2L**31))/-1. -- http://mail.python.org/mailman/listinfo/python-list
Re: math.pow(x,y)
[Wojciech Muła] >> You have to use operator **, i.e. 34564323**456356 Or the builtin pow() instead of math.pow(). [Gary Herron] > That's not very practical. That computation will produce a value with > more than 3.4 million digits. Yes. > (That is, log10(34564323)*456356 = 3440298.) Python will attempt this, but > I was not patient enough to see if it could calculate an answer today (or even > this week). On my box it took less than 30 seconds to do x = 34564323**456356 If you try to _display_ that as a decimal string, it will take enormously longer. Python uses a power-of-2 base internally, and conversion to decimal takes time quadratic in the number of digits. Doing y = hex(x) instead is very fast (small fraction of a second). > I doubt that you really *want* all 3.4 million digits. So what is it you > really want? A scientific or engineering result as a floating point > number accurate to some reasonable number of digits? That integer value > modulo some other integer (as used in various cryptology schemes)? For example, if you only want the last 6 digits, pow(34564323, 456356, 100) returns 986961 in an eyeblink. -- http://mail.python.org/mailman/listinfo/python-list