Re: random writing access to a file in Python
[Claudio Grondi] > I have a 250 Gbyte file (occupies the whole hard drive space) Then where is Python stored ;-)? > and want to change only eight bytes in this file at a given offset of appr. > 200 > Gbyte (all other data in that file should remain unchanged). > > How can I do that in Python? Same way you'd do it in C, really: f = open(PATH_TO_FILE, "r+b") f.seek(appr. 200 Gbyte) f.write(A_STRING_CONTAINING_YOUR_NEW_8_BYTES) f.close() This depends on your operating system and file system and platform C library supporting seeks to very large offsets. For example, Windows using NTFS does. Try it. Python should complain (raise an exception during the seek() attempt) if your box refuses to cooperate. Use as recent a released version of Python as you can. -- http://mail.python.org/mailman/listinfo/python-list
Re: time.clock() going backwards??
[Giovanni Bajo[ >> I experimented something very strange, a few days ago. I was debugging an >> application at a customer's site, and the problem turned out to be that >> time.clock() was going "backwards", that is it was sometimes >> (randomically) returning a floating point value which was "less than" the >> value returned by the previous invokation. The computer was a pretty fast >> one (P4 3Ghz I think, running Windows XP), and this happened only between >> very close invokations of time.clock(). [Terry Reed] > I seem to remember this being mentioned before on the list, perhaps by Tim > Peters. Perhaps he will chime in. No real need ;-) BIOS or HAL bug on a multiprocessor (or maybe even hyperthreaded) box is by far the most likely cause (and the only cause ever identified for sure by people who followed up). Python's C code slinging QueryPerformanceCounter is exceedingly simple, and isn't a suspect because of that. It's on the edge of vague possibility that Microsoft's compiler generates non-monotonic code for converting 64-bit integer to double: i.e., Python's C code obviously relies on that if i and j are _int64 satisfying i >= j, then (double)i >= (double)j. If that type-conversion code /is/ monotonic (which is almost certainly so), then the only ways the C code can fail are if the HW counter overflows (in which case time.clock() would "jump" a huge amount), or if the sequence of values returned by QueryPerformanceCounter is itself non-monotonic at times (which is consistent with known BIOS/HAL bugs). -- http://mail.python.org/mailman/listinfo/python-list
Re: Misleading error message when opening a file (on Windows XP SP 2)
[Claudio Grondi] > Here an example of what I mean > (Python 2.4.2, IDLE 1.1.2, Windows XP SP2, NTFS file system, 80 GByte > large file): > > >>> f = file('veryBigFile.dat','r') > >>> f = file('veryBigFile.dat','r+') > > Traceback (most recent call last): >File "", line 1, in -toplevel- > f = file('veryBigFile.dat','r+') > IOError: [Errno 2] No such file or directory: 'veryBigFile.dat' > > Is it a BUG or a FEATURE? Assuming the file exists and isn't read-only, I bet it's a Windows bug, and that if you open in binary mode ("r+b") instead I bet it goes away (this wouldn't be the first large-file text-mode Windows bug). -- http://mail.python.org/mailman/listinfo/python-list
Re: naive misuse? (of PyThreadState_SetAsyncExc)
[EMAIL PROTECTED] >> The documentation for PyThreadState_SetAsyncExc says "To prevent naive >> misuse, you must write your own C extension to call this". Anyone care >> to list a few examples of such naive misuse? [and again] > No? I'll take that then as proof that it's impossible to misuse the > function. That's wise ;-) Stopping a thread asynchronously is in /general/ a dangerous thing to do, and for obvious reasons. For example, perhaps the victim thread is running in a library routine at the time the asynch exception is raised, and getting forcibly ejected from the normal control flow leaves a library-internal mutex locked forever. Or perhaps a catch-all "finally:" clause in the library manages to release the mutex, but leaves the internals in an inconsistent state. Etc. The same kinds of potential disasters accout for why Java deprecated its versions of this gimmick: http://java.sun.com/j2se/1.3/docs/guide/misc/threadPrimitiveDeprecation.html That said, you can invoke PyThreadState_SetAsyncExc() from Python using the `ctypes` module (which will be included in 2.5, and is available as an extension module for earlier Pythons). -- http://mail.python.org/mailman/listinfo/python-list
Re: A Sort Optimization Technique: decorate-sort-dedecorate
[Joachim Durchholz] >>> Wikipedia says it's going from 2NlogN to N. If a sort is massively >>> dominated by the comparison, that could give a speedup of up to 100% >>> (approximately - dropping the logN factor is almost irrelevant, what >>> counts is losing that factor of 2). [Gabriel Genellina] >> In fact it's the other way - losing a factor of 2 is irrelevant, >> O(2N)=O(N). The logN factor is crucial here. [Joachim Durchholz] > That's just a question of what you're interested in. > > If it's asymptotic behavior, then the O(logN) factor is a difference. > > If it's practical speed, a constant factor of 2 is far more relevant > than any O(logN) factor. Nope. Even if you're thinking of base 10 logarithms, log(N, 10) > 2 for every N > 100. Base 2 logarithms are actually most appropriate here, and log(N, 2) > 2 for every N > 4. So even if the "2" made sense here (it doesn't -- see next paragraph), the log(N) term dominates it for even relatively tiny values of N. Now it so happens that current versions of Python (and Perl) use merge sorts, where the worst-case number of comparisons is roughly N*log(N, 2), not Wikipedia's 2*N*log(N, 2) (BTW, the Wikipedia article neglected to specify the base of its logarithms, but it's clearly intended to be 2). So that factor of 2 doesn't even exist in current reality: the factor of log(N, 2) is the /whole/ worst-case story here, and, e.g., is near 10 when N is near 1000. A factor of 10 is nothing to sneeze at. OTOH, current versions of Python (and Perl) also happen to use "adaptive" merge sorts, which can do as few as N-1 comparisons in the best case (the input is already sorted, or reverse-sorted). Now I'm not sure why anyone would sort a list already known to be sorted, but if you do, DSU is a waste of time. In any other case, it probably wins, and usually wins big, and solely by saving a factor of (up to) log(N, 2) key computations. > (I'm not on the list, so I won't see responses unless specifically CC'd.) Done. -- http://mail.python.org/mailman/listinfo/python-list
Re: A Sort Optimization Technique: decorate-sort-dedecorate
[Joachim Durchholz] >>>>> Wikipedia says it's going from 2NlogN to N. If a sort is massively >>>>> dominated by the comparison, that could give a speedup of up to 100% >>>>> (approximately - dropping the logN factor is almost irrelevant, what >>>>> counts is losing that factor of 2). [Gabriel Genellina] >>>> In fact it's the other way - losing a factor of 2 is irrelevant, >>>> O(2N)=O(N). The logN factor is crucial here. [Joachim Durchholz] >>> That's just a question of what you're interested in. >>> >>> If it's asymptotic behavior, then the O(logN) factor is a difference. >>> >>> If it's practical speed, a constant factor of 2 is far more relevant >>> than any O(logN) factor. [Tim Peters] >> Nope. Even if you're thinking of base 10 logarithms, log(N, 10) > 2 >> for every N > 100. Base 2 logarithms are actually most appropriate >> here, and log(N, 2) > 2 for every N > 4. So even if the "2" made >> sense here (it doesn't -- see next paragraph), the log(N) term >> dominates it for even relatively tiny values of N. [Joachim Durchholz] > Whether this argument is relevant depends on the constant factors > associated with each term. I'm afraid you're still missing the point in this example: it's not just that Python's (& Perl's) current sorts do O(N*log(N)) worst-case comparisons, it's that they /do/ N*log(N, 2) worst-case comparisons. O() notation isn't being used, and there is no "constant factor" here: the count of worst-case comparisons made is close to exactly N*log(N, 2), not to some mystery-constant times N*log(N, 2). For example, sorting a randomly permuted array with a million distinct elements will require nearly 100*log(100, 2) ~= 100 * 20 = 20 million comparisons, and DSU will save about 19 million key computations in this case. O() arguments are irrelevant to this, and the Wikipedia page you started from wasn't making an O() argument either: http://en.wikipedia.org/wiki/Schwartzian_transform For an efficient ordinary sort function, the number of invocations of the transform function goes from an average of 2nlogn to n; No O() in sight, and no O() was intended there either. You do exactly N key computations when using DSU, while the hypothetical "efficient ordinary sort function" the author had in mind does about 2*N*log(N, 2) key computations when not using DSU. That's overly pessimistic for Python's & Perl's current sort functions, where no more than N*log(N, 2) key computations are done when not using DSU. The /factor/ of key computations saved is thus as large as N*log(N, 2) / N = log(N, 2). O() behavior has nothing to do with this result, and the factor of log(N, 2) is as real as it gets. If key extraction is at all expensive, and N isn't trivially small, saving a factor of log(N, 2) key extractions is /enormously/ helpful. If this is sinking in now, reread the rest of my last reply that got snipped hre. > Roughly speaking, if the constant factor on the O(N) term is 100 and the > constant factor on the O(logN) term is 1, then it's still irrelevant. As above, it's talking about O() that's actually irrelevant in this specific case. > My point actually is this: big-Oh measures are fine for comparing > algorithms in general, but when it comes to optimizing concrete > implementations, its value greatly diminishes: you still have to > investigate the constant factors and all the details that the big-Oh > notation abstracts away. That's true, although the argument here wasn't actually abstracting away anything. You've been adding abstraction to an argument that didn't have any ;-) > From that point of view, it's irrelevant whether some part of the > algorithm contributes an O(1) or an O(logN) factor: the decision where > to optimize is almost entirely dominated by the constant factors. While true in some cases, it's irrelevant to this specific case. More, in practice a factor of O(log(N)) is almost always more important "for real" than a factor of O(1) anyway -- theoretical algorithms hiding gigantic constant factors in O() notion are very rarely used in real life. For example, the best-known algorithm for finding the number of primes <= x has O(sqrt(x)) time complexity, but AFAIK has /never/ been implemented because the constant factor is believed to be gigantic indeed. Theoretical CompSci is full of results like that, but they have little bearing on practical programming. -- http://mail.python.org/mailman/listinfo/python-list
Re: A Sort Optimization Technique: decorate-sort-dedecorate
[/T] >> OTOH, current versions of Python (and Perl) [/F] > just curious, but all this use of (& Perl) mean that the Perl folks have > implemented timsort ? A remarkable case of independent harmonic convergence: http://mail.python.org/pipermail/python-dev/2002-July/026946.html Come to think of it, I don't actually know whether a /released/ Perl ever contained the development code I saw. Looks like it got added to Perl 5.8: http://perldoc.perl.org/sort.html -- http://mail.python.org/mailman/listinfo/python-list
Re: GC and security
[Aahz] >> Assuming you're talking about CPython, strings don't really participate >> in garbage collection. Keep in mind that the primary mechanism for >> reaping memory is reference counting, and generally as soon as the >> refcount for an object goes to zero, it gets deleted from memory. [Les Schaffer] > ok so far ... Not really ;-) Refcounting is /a/ "garbage collection" technique, and drawing a distinction between refcount-based reclamation and other ways CPython reclaims memory has no bearing on your question. >> Garbage collection only gets used for objects that refer to other >> objects, so it would only apply if string refcounts are being held by >> GC-able objects. > you lost me by here ... That tends to happen when an irrelevant distinction gets made ;-) > is there something different about string objects than other objects in > Python? Not anything relevant wrt garbage collection. It may be relevant that strings are immutable, since that prevents you from overwriting a string's contents yourself. > are you saying EVERY string in Python stays in memory for the lifetime > of the app? He wasn't, and they don't. >> Also keep in mind, of course, that deleting objects has nothing to do >> with whether the memory gets overwritten... > no chance of being overwritten until deleted, no? True. > and once deleted over time there is some probability of being > overwritten, no? True. Also true if you add the intended "non-zero" qualifier to "probability" ;-) > and i am curious how that works. Purely accidental -- nothing guaranteed -- details can (& do) change across releases. Read obmalloc.c for a tour of the accidents du jour. > it sounds like you are saying once a string, always the same string, in > python. > if thats true, i am glad i know that. Not true, so be glad to forget it. A curious possibility: if you do a debug build of Python, obmalloc.c arranges to overwrite all of an object's memory as soon as the object is reclaimed (by any means, refcounting or otherwise). That wasn't for "security" (faux or otherwise), it's to make it easier to detect buggy C code referencing freed memory. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
[Licheng Fang] > ... > I want to know if there is some way to make Python RE behave like grep > does, Not in general, no. The matching strategies couldn't be more different, and that's both deep and intentional. See Friedl's book for details: http://regex.info/ > or do I have to change to another engine? Yes, if POSIX regexp semantics are what you require. Several years ago there was an extension module for Python supplying POSIX semantics, but I couldn't find anything current just now in a minute of searching. You may be more motivated to search longer ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
[Licheng Fang[ > ... > In fact, what I'm doing is handle a lot of regular expressions. I > wanted to build VERY LONG regexps part by part and put them all into a > file for easy modification and maintenance. The idea is like this: > > (*INT) = \d+ > (*DECIMAL) = (*INT)\.(*INT) > (*FACTION) = (*DECIMAL)/(*DECIMAL) > (*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT) > ... ... > > What's inside the sytactically wrong (* and ) is something to be > replaced, and then I wrote a little script to do the string > substitution, to get very long regexps to be compiled. I thought that > might be a way to handle long and complex regexps, and in this course I > encountered the problem with the semantics of '|'. > > My approach may sound desperate and funny, but please, do you have any > good idea as to how to handle long and complex regexps? My best advice is to find a different approach entirely. For example, build a parser using parser technology, and use regexps in that /only/ to do gross tokenization ("string of digits", "decimal point", ...): build the rest with a grammar. Regexps are a brittle tool, best tolerated in small doses. For an "NFA" implementation like Python's, you're likely to experience poor speed when combining many complex regexps, and /especially/ when alternatives are ambiguous wrt prefixes (and yours are, else you wouldn't have a problem with "longest match" versus "some shorter match" to begin with. OTOH, under a "DFA" implementation such as POSIX grep's, you're likely to experience exponential memory requirements (DFA implementations can need to build enormous state machines, tracing out in advance all possible paths through all the regexps applied to all possible input strings). Just sounds to me like the wrong tool for the job. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
[Licheng Fang] >> Oh, please do have a look at the second link I've posted. There's a >> table comparing the regexp engines. The engines you've tested probably >> all use an NFA implementation. [Bryan Olson] > Unfortunately, the stuff about NFA's is wrong. Friedl's awful > book Strongly disagree: it's an excellent book about the /pragmatics/ of using "regular expressions" as most widely implemented. It's not at all about "regular expressions" in the CompSci sense of the term, which appears to be your complaint. > was the first time I saw this confusion about what NFA is; > I don't know if he originated the mess or just propagated it. As far as I could tell at the time, he originated it. I'm not happy about that either. > "Nondeterministic finite automata" is well defined in theory > of computation. The set of languages accepted by NFA's is > exactly the same as the set accepted by DFA's. And which has very little to do with "regular expressions" as most widely implemented -- gimmicks like backreferences are wholly outside the DFA formalism. > What Python uses is search-and-backtrack. Unfortunately such > engines don't have much theory behind them, and it's hard to > reason generally about what they do. Yup X 3, and the last is precisely why Friedl's book is valuable for people using "NFA" implementations: Friedl does a good job of explaining when and why you're likely to trigger atrocious runtime performance, and gives practical general techniques for avoiding those problems. None of that has anything to do with CompSci regexps either, but is vital knowledge for people hoping to make happy non-trivial use of Python/Perl/etc regexps. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
[Licheng Fang] >> Basically, the problem is this: >> >> >>> p = re.compile("do|dolittle") >> >>> p.match("dolittle").group() >> 'do' ... >> The Python regular expression engine doesn't exaust all the >> possibilities, but in my application I hope to get the longest possible >> match, starting from a given point. >> >> Is there a way to do this in Python? [Bryan Olson] > Yes. Here's a way, but it sucks real bad: > > > def longest_match(re_string, text): > regexp = re.compile('(?:' + re_string + ')$') > while text: > m = regexp.match(text) > if m: > return m > text = text[:-1] > return None If you want to try something like that, note that the match() method accepts optional slice arguments, so the "text = text[:-1]" business can be replaced with much quicker little-integer arithmetic. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
[Bryan Olson] >>> Unfortunately, the stuff about NFA's is wrong. Friedl's awful >>> book [Tim Peters] >> Strongly disagree: [...] [Bryan] > I know I'm disagreeing with a lot of smart people in panning > the book. That's allowed :-) >>> What Python uses is search-and-backtrack. Unfortunately such >>> engines don't have much theory behind them, and it's hard to >>> reason generally about what they do. >> Yup X 3, and the last is precisely why Friedl's book is valuable for >> people using "NFA" implementations: Friedl does a good job of >> explaining when and why you're likely to trigger atrocious runtime >> performance, and gives practical general techniques for avoiding those >> problems. None of that has anything to do with CompSci regexps >> either, but is vital knowledge for people hoping to make happy >> non-trivial use of Python/Perl/etc regexps. > I read the first edition, minus some engine-specific stuff. > General techniques are what I want, and I didn't see them in the > book. He had things to do and not do, but it didn't add up to > building (ir-)regular expressions with reliably tractable > behavior. My guess then is that the book moved at too slow a pace for you, and you got bored. The most valuable general technique he (eventually ;-) explained he called "unrolling", and consists of writing a regexp in the form: normal* (?: special normal* )* where the sets of characters with which `normal` and `special` can start are disjoint. Under most "NFA" implementation, such a regexp is immune to most bad behaviors, whether finding very long matches, or in searching a very long string that happens not to contain any matches. Without knowing this, under many implementations it's very easy to end up writing a regexp that matches quickly when a short match exists, but "blows up" (stack fault, or some checked internal limit hit) if only a very long match exists; and/or takes even exponential time to fail to match when a match doesn't exist. For example, a direct application is writing a robust regular expression to match C string literals (also one form of Python string literal: double-quote character at each end, and backslash escapes, including backslash-newline to span lines): " [^"\\\n]* (?: \\[\x00-\xff] [^"\\\n]* )* " The "normal case" is that it's just a regular old character: not a double quote, backslash, or newline. The "special case" is a backslash followed by any character whatsoever. The "normal" and "special" cases obviously don't intersect, so the interior of this regexp (i.e., ignoring the double quotes at each end) fits the "unrolling" pattern to a tee. As a result, you can use it with reasonably high confidence. For /why/ that's so, read the rest of his book ;-) In effect, it gives the search engine only one possible course of action at each character in the target string. So long as that's "normal", it has to stay in the "normal" part of the regexp, and as soon as it's "special" it has to leave the "normal" part. The intent is to eliminate any possibility for backtracking, thus ensuring predictable runtime and ensuring that no form of internal backtracking stack needs to be used (let alone hit its limit). The unrolling idea was new to me, and as soon as I learned it I replaced lots of regexps in the Python library with rewritten versions that demonstrably performed very much better -- even allowed closing some bug reports concerning extreme cases (extremely long matches, and unbearable runtime on long strings without any matches). A different lesson to take from this is that nobody is likely to stumble into effective "NFA" techniques based on luck or "common sense". It's a set of obscure & specialized learned skills. -- http://mail.python.org/mailman/listinfo/python-list
Re: min() & max() vs sorted()
[MRAB] > Some time after reading about Python 2.5 and how the built-in functions > 'min' and 'max' will be getting a new 'key' argument, I wondered how > they would treat those cases where the keys were the same, for example: > > L = ["four", "five"] > print min(L, key = len), max(L, key = len) > > The result is: > > ('four', 'four') min() and max() both work left-to-right, and return the minimal or maximal element at the smallest index. > I would've thought that min(...) should return the same as > sorted(...)[0] (which it does) It does, but only because Python's sort is stable, so that minimal elements retain their original relative order. That implies that the minimal element with smallest original index will end up at index 0 after sorting. > and that max(...) should return the same as sorted(...)[-1] (which it > doesn't). Right -- although I don't know why you'd expect that. > I think that's just down to a subtlety in the way that 'max' is written. It's straightforward, skipping error cases: def max(*args): is_first_arg = True for arg in args: if is_first_arg: winner = arg is_first_arg = False elif arg > winner:# only difference in min() is "<" here instead winner = arg return winner -- http://mail.python.org/mailman/listinfo/python-list
Re: how do you convert and array of doubles into floats?
[Marc 'BlackJack' Rintsch] >> What about: >> >> b = array.array('f', a) [Diez B. Roggisch] > AFAIK d and f are synonym for arrays, as python doesn't distinguish > between these two on a type-level. And double it is in the end. While Python has no type of its own corresponding to the native C `float`, the `array` and `struct` modules do understand the native C `float` . A Python float (same as a C `double`) gets converted to a C `float` when stored into one of those, and a C `float` is converted to a Python float (C `double`) when a value is extracted. >>> from array import array >>> x = 1.01 >>> x 1.01 >>> array('d', [x])[0] # full precision is preserved 1.01 >>> array('d', [x])[0] == x True >>> array('f', [x])[0] # trailing bits are lost 1.0 >>> array('f', [x])[0] == x False -- http://mail.python.org/mailman/listinfo/python-list
Re: min() & max() vs sorted()
[MRAB] >>> Some time after reading about Python 2.5 and how the built-in functions >>> 'min' and 'max' will be getting a new 'key' argument, I wondered how >>> they would treat those cases where the keys were the same, for example: >>> >>> L = ["four", "five"] >>> print min(L, key = len), max(L, key = len) >>> >>> The result is: >>> >>> ('four', 'four') [Tim Peters] >> min() and max() both work left-to-right, and return the minimal or >> maximal element at the smallest index. [MRAB] > It doesn't say that in the documentation. Right, the /language/ doesn't define anything about which specific minimal or maximal element is returned. Since you specifically mentioned Python 2.5, I figured you were asking about CPython -- which has always behaved in the way I explained. >>> I would've thought that min(...) should return the same as >>> sorted(...)[0] (which it does) >> It does, but only because Python's sort is stable, so that minimal >> elements retain their original relative order. That implies that the >> minimal element with smallest original index will end up at index 0 >> after sorting. >>> and that max(...) should return the same as sorted(...)[-1] (which it doesn't). >> Right -- although I don't know why you'd expect that. > Strings have index(), find(), etc which work left-to-right and > rindex(), rfind(), etc which work right-to-left. > > Lists have index() but not rindex(). > > I just thought that if min() and max() work left-to-right then for > completeness there should also be rmin() and rmax(); alternatively, > min() should return sorted()[0] and max() should return sorted()[-1] > for symmetry (my personal preference). If you were to make either of those a feature request, I don't expect they'd gain traction -- I expect "who cares?" would be the common challenge, and "I do" wouldn't silence it ;-) Compelling use cases sometimes work to get a new feature, but "completeness" or "symmetry" almost never do on their own (they function more as sanity checks on proposed solutions to use cases). -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
[Tim Peters] >> [...] The most valuable general technique [Friedl] (eventually ;-) >> explained he called "unrolling", and consists of writing a regexp in >> the form: >> >>normal* (?: special normal* )* >> >> where the sets of characters with which `normal` and `special` can >> start are disjoint. Under most "NFA" implementation, such a regexp is >> immune to most bad behaviors, whether finding very long matches, or in >> searching a very long string that happens not to contain any matches. [Bryan Olson] > So how general is that? Between 6 and 7 ;-) > The books just says "certain common expressions". I think the real > answer is that one can unroll exactly the true regular expressions. I'm sure he didn't have that much in mind; e.g., it's a real strain to fit a fat alternation of fixed strings into that "pattern" (cat|dog|wombat|weasel|...).' > The unrolling is essentially resolving the states of a DFA by hand. Oh yes, for those who've studied the theoretical underpinnings. Most people have not. Much of what he gives as practical advice amounts to ways to "trick" an "NFA" engine into doing what a "DFA" engine would do, but explained from the POV of how a backtracking search engine works. And it makes /sense/ at that level too, divorced from any knowledge of how a linear-time automaton might be constructed in theory; e.g., you "want to" give the search engine only one choice because you want to avoid the possibility of needing to backtrack later, not because you're picturing a theoretical linear-time automaton and striving to mimic its behavior. That explanation may not work for you, but it's effective for people who haven't studied the theory here. Such explanations also extend naturally to gimmicks like backreferences, which have no counterpart in CompSci regexps. -- http://mail.python.org/mailman/listinfo/python-list
Re: Hardlinks on NTFS
[Wildemar Wildenburger] >> I'm thinking of letting my program create hardlinks (or symlinks). I >> know python allows doing this for ext, reiser and the like, but >> apparently not for ntfs systems. >> Is there any package out there that lets me create links in a platform >> independent way? [Calvin Spealman] > Why isn't NTFS handled the same as any other hardlink supporting > filesystem? Is it a choice or a bug? None of the above. It's not the filesystem, it's what the platform C supports. POSIX specifies a small pile of standard routines for manipulating hard and symbolic links, and Python exposes those /when/ they're available. Microsoft's C supplies none of them. So it's a question of someone on Windows wanting this enough to volunteer to write code, tests, and docs, and volunteer to maintain, a pile of Microsoft-specific code to provide workalike functionality on Windows. -- http://mail.python.org/mailman/listinfo/python-list
Re: Installing 2.5 with 2.4?
[Duncan Booth] >> Windows is only smart enough to avoid duplicate entries if you tell it >> to do that. e.g. >> >> PATH c:\python25;c:\python25\scripts;%PATH:c:\python25;c:\python25\scripts;=% >> >> will add the two Python 2.5 folders to the head of the path without >> duplicating them. [John Machin[ > Wow .. I didn't know that! What's the syntax? Something like > %variablename[:oldtext=[newtext]]% > ? Yup. > Where is this documented? >From a DOS box (cmd.exe), enter set /? -- http://mail.python.org/mailman/listinfo/python-list
Re: Maths error
[Tim Peters] ... >|> Well, just about any technical statement can be misleading if not >|> qualified to such an extent that the only people who can still >|> understand it knew it to begin with <0.8 wink>. The most dubious >|> statement here to my eyes is the intro's "exactness carries over >|> into arithmetic". It takes a world of additional words to explain >|> exactly what it is about the example given (0.1 + 0.1 + 0.1 - 0.3 = >|> 0 exactly in decimal fp, but not in binary fp) that does, and does >|> not, generalize. Roughly, it does generalize to one important >|> real-life use-case: adding and subtracting any number of decimal >|> quantities delivers the exact decimal result, /provided/ that >|> precision is set high enough that no rounding occurs. [Nick Maclaren] > Precisely. There is one other such statement, too: "Decimal numbers > can be represented exactly." What it MEANS is that numbers with a > short representation in decimal can be represented exactly in decimal, > which is tautologous, but many people READ it to say that numbers that > they are interested in can be represented exactly in decimal. Such as > pi, sqrt(2), 1/3 and so on Huh. I don't read it that way. If it said "numbers can be ..." I might, but reading that way seems to requires effort to overlook the "decimal" in "decimal numbers can be ...". [attribution lost] >|>> and how is decimal no better than binary? >|> Basically, they both lose info when rounding does occur. For >|> example, > Yes, but there are two ways in which binary is superior. Let's skip > the superior 'smoothness', as being too arcane an issue for this > group, With 28 decimal digits used by default, few apps would care about this anyway. > and deal with the other. In binary, calculating the mid-point > of two numbers (a very common operation) is guaranteed to be within > the range defined by those numbers, or to over/under-flow. > > Neither (x+y)/2.0 nor (x/2.0+y/2.0) are necessarily within the range > (x,y) in decimal, even for the most respectable values of x and y. > This was a MAJOR "gotcha" in the days before binary became standard, > and will clearly return with decimal. I view this as being an instance of "lose info when rounding does occur". For example, >>> import decimal as d >>> s = d.Decimal("." + "9" * d.getcontext().prec) >>> s Decimal("0.") >>> (s+s)/2 Decimal("1.000") >>> s/2 + s/2 Decimal("1.000") "The problems" there are due to rounding error: >>> s/2 # "the problem" in s/2+s/2 is that s/2 rounds up to exactly 1/2 Decimal("0.5000") >>> s+s # "the problem" in (s+s)/2 is that s+s rounds up to exactly 2 Decimal("2.000") It's always something ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Maths error
[Tim Peters] ... >> Huh. I don't read it that way. If it said "numbers can be ..." I >> might, but reading that way seems to requires effort to overlook the >> "decimal" in "decimal numbers can be ...". [Nick Maclaren] > I wouldn't expect YOU to read it that way, Of course I meant "putting myself in others' shoes, I don't ...". > but I can assure you from experience that many people do. Sure. Possibly even most. Short of writing a long & gentle tutorial, can that be improved? Alas, most people wouldn't read that either <0.5 wink>. > What it MEANS is "Numbers with a short representation in decimal "short" is a red herring here: Python's Decimal constructor ignores the precision setting, retaining all the digits you give. For example, if you pass a string with a million decimal digits, you'll end up with a very fat Decimal instance -- no info is lost. > can be represented exactly in decimal arithmetic", which is > tautologous. What they READ it to mean is "One advantage of > representing numbers in decimal is that they can be represented > exactly", and they then assume that also applies to pi, sqrt(2), > 1/3 > > The point is that the "decimal" could apply equally well to the > external or internal representation and, if you aren't fairly > clued-up in this area, it is easy to choose the wrong one. Worse, I expect most people have no real idea of that there's a possible difference between internal and external representations. This is often given as a selling point for decimal arithmetic: it's WYSIWYG in ways binary fp can't be (short of inventing power-of-2 fp representations for I/O, which few people would use). [attribution lost] >>>>> and how is decimal no better than binary? [Tim] >>>> Basically, they both lose info when rounding does occur. For >>>> example, [Nick] >>> Yes, but there are two ways in which binary is superior. Let's >>> skip the superior 'smoothness', as being too arcane an issue for >>> this group, >> With 28 decimal digits used by default, few apps would care about >> this anyway. > Were you in the computer arithmetic area during the "base wars" of the > 1960s and 1970s that culminated with binary winning out? Yes, although I came in on the tail end of that and never actually used a non-binary machine. > A lot of very well-respected numerical analysts said that larger bases > led to a faster build-up of error (independent of the precision). My > limited investigations indicated that there was SOME truth in that, > but it wasn't a major matter; I never say the matter settled > definitively. My point was that 28 decimal digits of precision is far greater than supplied even by 64-bit binary floats today (let alone the smaller sizes in most-common use back in the 60s and 70s). "Pollution" of low-order bits is far less of a real concern when there are some number of low- order bits you don't care about at all. >>> and deal with the other. In binary, calculating the mid-point >>> of two numbers (a very common operation) is guaranteed to be within >>> the range defined by those numbers, or to over/under-flow. >>> >>> Neither (x+y)/2.0 nor (x/2.0+y/2.0) are necessarily within the range >>> (x,y) in decimal, even for the most respectable values of x and y. >>> This was a MAJOR "gotcha" in the days before binary became standard, >>> and will clearly return with decimal. >> I view this as being an instance of "lose info when rounding does >> occur". For example, > No, absolutely NOT! Of course it is. If there were no rounding errors, the computed result would be exactly right -- that's darned near tautological too. You snipped the examples I gave showing exactly where and how rounding error created the problems in (x+y)/2 and x/2+y/2 for some specific values of x and y using decimal arithmetic. If you don't like those examples, supply your own, and if you get a similarly surprising result you'll find rounding error(s) occur(s) in yours too. It so happens that rounding errors in binary fp can't lead to the same counterintuitive /outcome/, essentially because x+x == y+y implies x == y in base 2 fp, which is indeed a bit of magic specific to base 2. The fact that there /do/ exist fp x and y such that x != y yet x+x == y+y in bases > 2 is entirely due to fp rounding error losing info. > This is an orthogonal matter, Disagree. > and is about the loss of an important invariant when using any base > above 2. It is that. > Back in the days when the
Re: Maths error
[Nick Maclaren] >> ... >> Yes, but that wasn't their point. It was that in (say) iterative >> algorithms, the error builds up by a factor of the base at every >> step. If it wasn't for the fact that errors build up, almost all >> programs could ignore numerical analysis and still get reliable >> answers! >> >> Actually, my (limited) investigations indicated that such an error >> build-up was extremely rare - I could achieve it only in VERY >> artificial programs. But I did find that the errors built up faster >> for higher bases, so that a reasonable rule of thumb is that 28 >> digits with a decimal base was comparable to (say) 80 bits with a >> binary base. [Hendrik van Rooyen] > I would have thought that this sort of thing was a natural consequence > of rounding errors - if I round (or worse truncate) a binary, I can be > off by at most one, with an expectation of a half of a least > significant digit, while if I use hex digits, my expectation is around > eight, and for decimal around five... Which, in all cases, is a half ULP at worst (when rounding -- as everyone does now). > So it would seem natural that errors would propagate > faster on big base systems, AOTBE, but this may be > a naive view.. I don't know of any current support for this view. It the bad old days, such things were often confused by architectures that mixed non-binary bases with "creative" rounding rules (like truncation indeed), and it could be hard to know where to "pin the blame". What you will still see stated is variations on Kahan's telegraphic "binary is better than any other radix for error analysis (but not very much)", listed as one of two techincal advantages for binary fp in: http://www.cs.berkeley.edu/~wkahan/MktgMath.pdf It's important to note that he says "error analysis", not "error propagation" -- regardless of base in use, rounding is good to <= 1/2 ULP. A fuller elementary explanation of this can be found in David Goldberg's widely available "What Every Computer Scientist Should Know About Floating-Point", in its "Relative Error and Ulps" section. The short course is that rigorous forward error analysis of fp algorithms is usually framed in terms of relative error: given a computed approximation x' to the mathematically exact result x, what's the largest possible absolute value of the mathematical r = (x'-x)/x (the relative error of x')? This framework gets used because it's more- or-less tractable, starting by assuming inputs are exact (or not, in which case you start by bounding the inputs' relative errors), then successively computing relative errors for each step of the algorithm. Goldberg's paper, and Knuth volume 2, contain many introductory examples of rigorous analysis using this approach. Analysis of relative error generally goes along independent of FP base. It's at the end, when you want to transform a statement about relative error into a statement about error as measured by ULPs (units in the last place), where the base comes in strongly. As Goldberg explains, the larger the fp base the sloppier the relative-error-converted-to-ULPs bound is -- but this is by a constant factor independent of the algorithm being analyzed, hence Kahan's "... better ... but not very much". In more words from Goldberg: Since epsilon [a measure of relative error] can overestimate the effect of rounding to the nearest floating-point number by the wobble factor of B [the FP base, like 2 for binary or 10 for decimal], error estimates of formulas will be tighter on machines with a small B. When only the order of magnitude of rounding error is of interest, ulps and epsilon may be used interchangeably, since they differ by at most a factor of B. So that factor of B is irrelevant to most apps most of the time. For a combination of an fp algorithm + set of inputs near the edge of giving gibberish results, of course it can be important. Someone using Python's decimal implementation has an often very effective workaround then, short of writing a more robust fp algorithm: just boost the precision. -- http://mail.python.org/mailman/listinfo/python-list
Re: doctest problem with null byte
[Stuart D. Gathman] > I am trying to create a doctest test case for the following: > > def quote_value(s): > """Quote the value for a key-value pair in Received-SPF header > field if needed. No quoting needed for a dot-atom value. > > >>> quote_value(r'abc\def') > '"abcdef"' > >>> quote_value('abc..def') > '"abc..def"' > >>> quote_value('') > '""' > >>> quote_value('-all\x00') > '"-allx00"' > ... > > However, doctest does *not* like the null byte in the example (yes, > this happens with real life input): > ** > File "/home/stuart/pyspf/spf.py", line 1453, in spf.quote_value > Failed example: > quote_value('-all') > Exception raised: > Traceback (most recent call last): > File > "/var/tmp/python2.4-2.4.4c1-root/usr/lib/python2.4/doctest.py", > line 1248, in __run > compileflags, 1) in test.globs > TypeError: compile() expected string without null bytes > ** > > How can I construct a test cast for this? As the docs say, doctest examples are parsed /twice/: once when Python compiles the module and creates string objects, and again when doctest extracts and passes example substrings to the compile() function (which you can see in your traceback). The first time, the \x00 in the quote_value('-all\x00') portion of your docstring got changed into an honest-to-goodness NUL byte. Passing that to compile() can't work. To alleviate that, and the "leaning toothpick syndrome" in your other examples, a simple approach is to make your docstring a raw string (stick the letter 'r' in front of the opening triple-quote). For example, this works fine: def f(): r""" >>> len("\\") 1 >>> ord("\x00") 0 """ Because it's a raw string, Python does /not/ change the \x00 portion into a NUL byte when it compile the module. Instead the docstring continues to contain the 4-character substring: \x00 and that's what compile() needs to see. To perform the same test without using a raw string requires slamming in more backslashes (or other ad-hoc escaping tricks): def g(): """ >>> len("") 1 >>> ord("\\x00") 0 """ -- http://mail.python.org/mailman/listinfo/python-list
Re: Signed zeros: is this a bug?
[attribution lost] ... >>> Yup: the workaround seems to be as simple as replacing all occurrences >>> of -0.0 with -(0.0). I'm embarrassed that I didn't figure this out >>> sooner. >>> >>> >>> x, y = -(0.0), 0.0 >>> >>> x, y >>> (-0.0, 0.0) [Alex Martelli] >> Glad it works for you, but it's the kind of workaround that could >> break with any minor tweak/optimization to the compiler... >> very fragile:-(. [also Alex] > OTOH, Python 2.4 works just fine...: > > Python 2.4.3 (#1, Apr 7 2006, 10:54:33) > [GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> 0.0,-0.0 > (0.0, -0.0) > >>> -0.0,0.0 > (-0.0, 0.0) > > so it seems to be very specifically a 2.5 problem. It's a bug that keeps resurfacing, probably because there's no portable way to test that it stays fixed :-( (it's not an accident that the OP relied on atan2 to distinguish +0.0 from -0.0! they act the same in /almost/ all contexts). Python's grammar doesn't have negative numeric literals. Instead "-" CONSTANT_LITERAL looks like unary minus applied to the non-negative CONSTANT_LITERAL. All early versions of Python worked that way too. The first time the +/- 0.0 bug occurred is when the front end was changed to act as if "-" CONSTANT_LITERAL were a literal in its own right, and then that +0.0 == -0.0 caused the first instance of either in a compilation unit to be used as the value for all instances of both throughout the compilation unit. That was fixed by refusing to apply the optimimization if the value of CONSTANT_LITERAL was a float that compared equal to 0.0. 2.5 introduced a new front end and more ambitious constant-folding, and I expect the bug showed up again due to one of those. Note: printing the value of a float 0 may not reveal its sign. IIRC, glibc's float-to-string routines do display the sign of 0, but Microsoft's do not. However, atan2() is sensitive to the sign under both glibm and Microsoft's libm equivalent. -- http://mail.python.org/mailman/listinfo/python-list
Re: math.pow(x,y)
[Wojciech Muła] >> You have to use operator **, i.e. 34564323**456356 Or the builtin pow() instead of math.pow(). [Gary Herron] > That's not very practical. That computation will produce a value with > more than 3.4 million digits. Yes. > (That is, log10(34564323)*456356 = 3440298.) Python will attempt this, but > I was not patient enough to see if it could calculate an answer today (or even > this week). On my box it took less than 30 seconds to do x = 34564323**456356 If you try to _display_ that as a decimal string, it will take enormously longer. Python uses a power-of-2 base internally, and conversion to decimal takes time quadratic in the number of digits. Doing y = hex(x) instead is very fast (small fraction of a second). > I doubt that you really *want* all 3.4 million digits. So what is it you > really want? A scientific or engineering result as a floating point > number accurate to some reasonable number of digits? That integer value > modulo some other integer (as used in various cryptology schemes)? For example, if you only want the last 6 digits, pow(34564323, 456356, 100) returns 986961 in an eyeblink. -- http://mail.python.org/mailman/listinfo/python-list
Re: Numerics, NaNs, IEEE 754 and C99
[Nick Maclaren] Firstly, a FAR more common assumption is that integers wrap in twos' complement - Python does not do that. [Grant Edwards] >>> It used to [Fredrik Lundh] >> for integers ? what version was that ? [Grant] > Am I remebering incorrectly? Mostly but not entirely. > Didn't the old fixed-width integers operate modulo-wordsize? Not in Python. > I distinctly remember having to rewrite a bunch of checksum code when > fixed-width integers went away. Best guess is that you're working on a 32-bit box, and remember this Python <= 2.2 behavior specific to the left shift operator: >>> 1 << 31 -2147483648 >>> 1 << 32 0 On a 64-bit box (except Win64, which didn't exist at the time ;-)), those returned 2**31 and 2**32 instead, while "1 << 64" wrapped to 0 (etc). Python 2.3 starting warning about that: >>> 1 << 31 __main__:1: FutureWarning: x<>> 1 << 31 2147483648L >>> 1 << 32 4294967296L + - * / on short ints always complained about overflow before int-long unification, although IIRC very early Pythons missed the overflow error in (on a 32-bit box) int(-(2L**31))/-1. -- http://mail.python.org/mailman/listinfo/python-list
Re: PyObject_SetItem(..) *always* requires a Py_INCREF or not?
[EMAIL PROTECTED] > I would think everytime you add an item to a list you must increase > reference count of that item. _Someone_ needs to. When the function called to add the item does the incref itself, then it would be wrong for the caller to also incref the item. > http://docs.python.org/api/refcountDetails.html has an example that > seems to contradict that > > int > set_all(PyObject *target, PyObject *item) > { > int i, n; > > n = PyObject_Length(target); > if (n < 0) > return -1; > for (i = 0; i < n; i++) { > if (PyObject_SetItem(target, i, item) < 0) > return -1; > } > return 0; > } > > *WHY* don't you need a Py_INCREF(item); in the for loop!?!?!? You should take a break, and read that section again later ;-) The _point_ of that example is in fact to illustrate that you don't need to incref when calling PyObject_SetItem(). While I can't know, I'm guessing that you're still "seeing" PyList_SetItem() here, which has radically different behavior. PyList_SetItem() steals a reference to its second argument, but PyObject_SetItem() does not. Read the whole section again from its start, and this should be much clearer the second time through. -- http://mail.python.org/mailman/listinfo/python-list
Re: need helping tracking down weird bug in cPickle
[Carl J. Van Arsdall] > Hey everyone, cPickle is raising an ImportError that I just don't quite > understand. When that happens, the overwhelmingly most likely cause is that the set of modules on your PYTHONPATH has changed since the pickle was first created, in ways such that a module _referenced_ inside the pickle can no longer be found. For example, here's file resourcepmdb.py on my box: def f(): print "hi" Here's a bit of Python code that uses it: import cPickle import resourcepmdb pf = open("junk.pik", "wb") pickler = cPickle.Pickler(pf) pickler.dump([resourcepmdb.f]) pf.close() Pickle stores module globals like user-defined classes and module functions (the case above) not by saving the code they contain, but _just_ by saving strings recording the module's name and the global's name. The unpickling process creates an import out of those strings. So long as resourcepmdb stays on my PYTHONPATH, unpickling junk.pik works fine. For example, this code: import cPickle pf = open("junk.pik", "rb") pickler = cPickle.Unpickler(pf) whatever = pickler.load() pf.close() print whatever prints something like [] But if resourcepmdb gets off my path (for example, resourcepmdb.py gets deleted, renamed, or moved -- or PYTHONPATH itself changes), that same code dies with exactly the error you reported: Traceback (most recent call last): File "pik2.py", line 5, in ? whatever = pickler.load() ImportError: No module named resourcepmdb > ... > I've done some research and it looks like cPickle tries to load some > modules as some kind of test. That's unlikely ;-) > From what I can tell the module that cPickle is trying to load is a > concatenation of the module that it exists in and part of the string > that was sent in the orignal request. Sorry, can't tell from here. > It looks like something where memory is getting overwritten, but I'm not > really doing anything too fancy, so i'm not sure how to approach this > problem. I'm using python2.2. Does anyone know how I might go about > trying to troubleshoot something like this? Is there a good tool or a > better method of error handling I can use to try and track this down? > Or is it better to give up troubleshooting and just upgrade to 2.4? You could at least _install_ a recent Python so you could use the newer pickletools module. pickletools.dis() displays a readable "disassembly" of a pickle, so that there's no need to guess about what the pickle _thinks_ it's trying to do. pickletools won't have any problem working with pickles created by an older Python. > ... > Here's the traceback: > > Traceback (most recent call last): > File "/home/build/bin/resourceManager/resourceQueue.py", line 50, in > initQueue > pickledList = cPickle.load(queueFPtr) > ImportError: No module named resourcepmdb' See above for the most obvious way to get an error like that. -- http://mail.python.org/mailman/listinfo/python-list
Re: Iteration over recursion?
[MTD] > I've been messing about for fun creating a trial division factorizing > function and I'm naturally interested in optimising it as much as > possible. > > I've been told that iteration in python is generally more > time-efficient than recursion. Is that true? Since you heard it from me to begin with, I suppose saying "yes" again won't really help ;-) You can use time.time() or time.clock(), or the `timeit` module, to measure speed. Try timing a dead-simple factorial functions both ways. BTW, I've worked on Python's implementation for close to 15 years now, so when I say something about Python, it's _usually_ safe to just believe it :-) Questioning is fine, but in cases like this where it's easy to find out for yourself, I probably won't make time to respond. > Here is my algorithm as it stands. Any suggestions appreciated! > > from math import * > > def factorize(x): > """ Return factors of x """ > factors = [] > lim = int(sqrt(x)) Here you're limited to integers for which a floating sqrt is accurate. That's 53 bits on most platforms. For integers larger than that, the code may produce incorrect results in mysterious ways at times (because the square root may be inaccurate). > y = divmod(x,2) Most people would use "unpacking assignment" here instead, like q, r = divmod(x, 2) Then later code gets to use meaningful names instead of messing with mysterious little integer subscripts. It's also quicker to use local names than to build subscripting expressions. > if y[1] == 0: # x is even > factors = factors + [2] + factorize(y[0]) Since `factors` upon entry to this block must be [], it's kinda contorted. More obvious (if you _have_ to use recursion) would be: return [2] + factorize(y[0]) and then unindent the rest of the code a level. > else: # x is odd > i = 3 > while i <= lim: > y = divmod(x,i) Again more commonly written: q, r = divmod(x, i) > if y[1] == 0: > factors = factors + [i] + factorize(y[0]) It hardly matters in this context, but appending to a list is _generally_ much faster than building a brand new list from scratch. At a higher level, your recursions always start from 2 again; e.g., if i happens to be 434349 here, the factorize(y[0]) call starts over from 2, despite that you already know that no divisor less than 434349 can succeed. To do this _efficiently_ with recursion requires passing more knowledge across recursion levels. > i = lim+1 > else: > i += 2 > > if factors == []: # necessary step for correct recursion > factors = [x] > > return factors Recursion is pretty bizarre here. Not _wrong_, just "pretty bizarre", and is leading you into higher-level inefficiences. There are all sorts of ways to repair that, but I'm not going to talk about them because it's easy to write an iterative algorithm instead. Here's one way to do that, avoiding float sqrt (this works correctly for integers of any size, although trial division is hopelessly _inefficient_ for finding "large" factors), and skipping multiples of 3 in the inner loop too: def trial(n): if n <= 1: return [n] factors = [] for tiny in 2, 3: while n % tiny == 0: factors.append(tiny) n //= tiny delta = 2 d = 5 while 1: q, r = divmod(n, d) if r: if q < d: break d += delta delta = 6 - delta else: factors.append(d) n = q if n > 1: factors.append(n) return factors If we call the original input N, the key invariants holding at the top of the loop are: 1. N >= 2, and N == n * the product of the integers in `factors` 2. n is not divisible by any prime < d 3. all elements in `factors` are prime It may take a bit of thought to see that d marches over all integers >= 5 that aren't divisible by either 2 or 3: 5, 7, 11, 13, 17, 19, 23, 25, 29, 31, ... It's more-or-less generally true that establishing and proving loop invariants in an iterative algorithm is "the same thing" as establishing and proving pre- and post-conditions in a recursive algorithm. While it may feel strained at first, it's very much worthwhile learning how to do that. For example, from #2 it's easy to prove that if q < d, n must either be 1 or a prime. That's what justifies the "break" statement inside the loop. Then since that's the only way to get out of the loop, n is either 1 or a prime when the loop finishes, and that explains the rest of the code. -- http://mail.python.org/mailman/listinfo/python-list
Re: Python SHA-1 as a method for unique file identification ? [help!]
[EP <[EMAIL PROTECTED]>] > This inquiry may either turn out to be about the suitability of the > SHA-1 (160 bit digest) for file identification, the sha function in > Python ... or about some error in my script It's your script. Always open binary files in binary mode. It's a disaster on Windows if you don't (if you open a file in text mode on Windows, the OS pretends that EOF occurs at the first instance of byte chr(26) -- this is an ancient Windows behavior that made an odd kind of sense in the mists of history, and has persisted in worship of Backward Compatibility despite that the original reason for it went away _long_ ago). ... > I am trying to reduce duplicate files in storage at home - I have a > large number files (e.g. MP3s) which have been stored on disk multiple > times under different names or on different paths. ... > All seemed to be working until I examined my log files and found files > with the same SHA digest had different sizes according to > os.stat(fpath).st_size . This is on Windows XP. > > - Am I expecting too much of SHA-1? No. SHA-1 should work well for this. > - Is it that the os.stat data on Windows cannot be trusted? It can be trusted to the extent that anything on Windows can be trusted ;-) ... > def hashit(pth): > """Takes a file path and returns a SHA hash of its string""" > fs=open(pth,'r').read() Make that 'rb' instead of 'r', and your problem will go away. Do note that there are faster ways to go about this. For example, if a particular file has a unique size among the files you're looking at, there's no reason to even open that file (let alone compute its hash). Group the files by size, and within a size group you can find duplicates by, e.g., first comparing their initial 16 bytes, then the next 32, then the next 64 ... etc. If duplicates are uncommon, this can be a huge savings. For example, here's output from an old dup-finding Python program of mine run over a directory tree which happens to contain no duplicates: Files - Total 10,718 Unique 10,718 Duplicate0 # w/ unique size10,053 # w/ unique prefix 665 Bytes - Total 1,401,668,015 Unique 1,401,668,015 Duplicate 0 Read 76,688 Excess76,688 That last two lines mean it actually read a total of only about 77,000 file bytes on its way to proving there were no duplicates among about 11,000 files spanning about 1.4 gigabytes. No hashes are computed by that program. I'd use a hash if instead I were going to save a persistent (across runs) set of known hashes, so that answering "here's a new file -- same as one I already have?" could be done by computing its hash. While the program above generally reads "amazingly" few file bytes, it hammers the OS file directory services, and it would go faster to compute a hash of a new file than to run the whole analysis again. -- http://mail.python.org/mailman/listinfo/python-list
Re: Iteration over recursion?
[Kay Schluehr] > You might use a separate prime generator to produce prime factors. The > factorize algorithm becomes quite simple and configurable by prime > generators. Alas, yours was _so_ simple that it always takes time proportional to the largest prime factor of n (which may be n) instead of worst-case time proportional to sqrt(n). Repair that, and it's not quite so simple anymore. The speed of the sieve generator would also benefit a lot by never looking at even integers, beyond an initial "yield 2" ... the OP said he was looking for speed more than elegance, and they're not good buddies here ;-) > def eratosthenes(): > memo = {} > q = 2 > while True: > p = memo.pop(q, None) > if p is None: > yield q > memo[q*q] = q > else: > x = p + q > while x in memo: > x += p > memo[x] = p > q+=1 > > def factorize(n, sieve = eratosthenes): > if n <= 1: > return [n] > factors = [] > primes = sieve() > for q in primes: > while n % q == 0: > factors.append(q) > n //= q > if n == 1: > return factors At _some_ point you might think that's bound to be faster than only skipping multiples of 2 and 3, because it does fewer divisions. But to get to a particular prime p, the sieve there has to go around its outer loop about 3x as often as the trial() function goes around its inner loop, and grows a dict with about p/ln(p) entries (while the trial() function's memory use is constant), and those aren't free. Applied to 991**2 (constructed so that both functions take time proportional to sqrt(n)), on my box the factorize() function took 4x longer than trial() (approximately 16 seconds versus 4). Trial division isn't practical for such large inputs, but I picked the square of a "large" prime to give factorize() maximum advantage in skipping division attempts (by the time we get to a prime p, factorize() tries about p/ln(p) divisions, while trial() does about p/3; so trial() does about ln(p)/3 times as many divisions as factorize(): the larger the p we need, the larger that _factor_ gets). Alas, it appears that made the dict so big that cache faults on dict accesses more than wiped out the division advantage. At the smaller 83**2, factorize() took only about 3.6x longer, despite losing some of its relative division advantage. In short, it's never what you think it is ;-) -- http://mail.python.org/mailman/listinfo/python-list
Re: random.jumpahead: How to jump ahead exactly N steps?
[Matthew Wilson] > The random.jumpahead documentation says this: > > Changed in version 2.3: Instead of jumping to a specific state, n steps > ahead, jumpahead(n) jumps to another state likely to be separated by > many steps.. > > I really want a way to get to the Nth value in a random series started > with a particular seed. Is there any way to quickly do what jumpahead > apparently used to do? No known way, and it seems unlikely that any quick way will be found in my lifetime ;-) for the Mersenne Twister. In Pythons <= 2.2, the underlying generator was the algebraically _very_ much simpler original Wichman-Hill generator, and jumping ahead by exactly n steps was just a matter of a few modular integer exponentiations. That took time proportional to log(n), so was extremely fast. It was also much more _necessary_ using W-H, since W-H's period was only around 10**13, while the Twister's period is around 10**6001: even if you're going to use a billion random numbers per sequence, the Twister's period has a way-way-beyond astronomical number of non-overlapping subsequences of that length. The chance of hitting an overlap is correspondingly miniscule. > I devised this function, but I suspect it runs really slowly: Well, it takes exactly as much time as it takes to call random() n times. > def trudgeforward(n): > '''Advance the random generator's state by n calls.''' > for _ in xrange(n): random.random() > > So any speed tips would be very appreciated. What are you trying to achieve in the end? Take it as a fact that there is no known way to advance the Twister by n states faster than what you've already found. -- http://mail.python.org/mailman/listinfo/python-list
Re: Iteration over recursion?
[MTD <[EMAIL PROTECTED]>] > I've been testing my recursive function against your iterative > function, and yours is generally a quite steady 50% faster on > factorizing 2**n +/- 1 for 0 < n < 60. If you're still not skipping multiples of 3, that should account for most of it. > I think that, for a challenge, I'll try to make a recursive function that > matche or beats the iterative function -- it's worth the experiment! Don't stop on that one before you succeed ;-) Provided you make no more than one recursive call per factor, you can't make more than log(n, 2) recursive calls in total, and _typically_ far fewer than that. IOW, the number of recursive calls should generally be trivial, and factoring 2**n will be the worst case. -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Queue.Queue.queue.clear() thread-safe?
[Russell Warren] > I'm guessing no, since it skips down through any Lock semantics, Good guess :-) It's also unsafe because some internal conditions must be notified whenever the queue becomes empty (else you risk deadlock). > but I'm wondering what the best way to clear a Queue is then. > > Esentially I want to do a "get all" and ignore what pops out, but I > don't want to loop through a .get until empty because that could > potentially end up racing another thread that is more-or-less blindly > filling it asynchronously. > > Worst case I think I can just borrow the locking logic from Queue.get > and clear the deque inside this logic, but would prefer to not have to > write wrapper code that uses mechanisms inside the object that might > change in the future. There's simply no defined way to do this now. > Also - I can of course come up with some surrounding architecture to > avoid this concern altogether, but a thread-safe Queue clear would do > the trick and be a nice and short path to solution. > > If QueueInstance.queue.clear() isn't thread safe... what would be the > best way to do it? Also, if not, why is the queue deque not called > _queue to warn us away from it? "Consenting adults" -- if you want an operation that isn't supplied out of the box, and are willing to take the risk of breaking into the internals, Python isn't going to stop you from doing whatever you like. "mutex" isn't named "_mutex" for the same reason. A q.mutex.acquire() try: q.queue.clear() q.unfinished_tasks = 0 q.not_full.notify() q.all_tasks_done.notifyAll() finally: q.mutex.release() sequence probably works (caveat emptor). BTW, it may be _possible_ you could use the newer task_done()/join() Queue gimmicks in some way to get what you're after (I'm not really clear on that, exactly). Anyway, yes, there is cross-release risk in breaking into the internals like that. That's an "adult decision" for you to make. The only way to get permanent relief is to re-think your approach, or propose adding a new Queue.clear() method and Queue._clear() default implementation. I don't know whether that would be accepted -- it seems simple enough, although since you're the first person to ask for it 15 years :-), you won't get much traction arguing that's a critical lack, and every new bit of cruft becomes an ongoing burden too. -- http://mail.python.org/mailman/listinfo/python-list
Re: Is Queue.Queue.queue.clear() thread-safe?
[Russell Warren] >>> I'm guessing no, since it skips down through any Lock semantics, [Tim Peters] >> Good guess :-) It's also unsafe because some internal conditions must >> be notified whenever the queue becomes empty (else you risk deadlock). [Fredrik Lundh] > "also" ? if it weren't for the other things, the clear() call itself > would have been atomic enough, right ? Define "other things" :-) For example, Queue.get() relies on that when self._empty() is false, it can do self._get() successfully. That's 100% reliable now because the self.not_empty condition is in the acquired state across both operations. If some yahoo ignores the underlying mutex and clears the queue after self._empty() returns true but before pop() executes self._get(), the call to the latter will raise an exception. That kind of failure is what I considered to be an instance of a problem due to the OP's "skips down through any Lock semantics"; the failure to notify internal conditions that the queue is empty is a different _kind_ of problem, hence my "also" above. If you're just asking whether deque.clear() is "atomic enough" on its own, define "enough" :-) It has to decref each item in the deque, and that can end up executing arbitrary Python code (due to __del__ methods or weakref callbacks), and that in turn can allow the GIL to be released allowing all other threads to run, and any of that Python-level code may even mutate the deque _while_ it's being cleared. The C code in deque_clear() looks threadsafe in the sense that it won't blow up regardless -- and that's about the strongest that can ever be said for a clear() method. dict.clear() is actually a bit stronger, acting as if a snapshot were taken of the dict's state at the instant dict.clear() is called. Any mutations made to the dict as a result of decref side effects while the original state is getting cleared survive. In contrast, if a new item is added to a deque d (via decref side effect) while d.clear() is executing, it won't survive. That's "atomic enough" for me, but it is kinda fuzzy. -- http://mail.python.org/mailman/listinfo/python-list
Re: Questions on Threading
[j.c.sackett] > I'm using the threading module to accomplish some distributed processing on > a project, and have a basic (I hope) question that I can't find an answer to > elsewhere. > > I've noted that there's a lot of documentation saying that there is no > external way to stop a thread, True. > and yet when creating a thread through the threading module I can see > that it has a method called _Thread__stop, which does indeed stop it. You're misreading the code then. __stop() is used internally by a thread T to record when T itself knows it's _about_ to stop. Other threads may be blocked in T.join() calls, and the purpose of T calling T.__stop() is to notify such threads they can proceed. > Is this in any way dangerous ( i.e. doesn't actually stop the thread, That's right. > or causes other thread timing issues down the road?) If you call T.__stop(), T will keep running (exactly as if you hadn't called it), but threads waiting on T.join() will think T has stopped. > Or is there some other reason that there is officially no stop method? Python never offered it primarily for the same reasons Java regretted offering it (so much so that Java eventually deprecated it): http://java.sun.com/j2se/1.3/docs/guide/misc/threadPrimitiveDeprecation.html -- http://mail.python.org/mailman/listinfo/python-list
Re: how to stop python...
[bruce] > perl has the concept of "die". does python have anything similar. how can a > python app be stopped? > > the docs refer to a sys.stop. Python docs? Doubt it ;-) > but i can't find anything else... am i missing something... >>> import sys >>> print sys.exit.__doc__ exit([status]) Exit the interpreter by raising SystemExit(status). If the status is omitted or None, it defaults to zero (i.e., success). If the status is numeric, it will be used as the system exit status. If it is another kind of object, it will be printed and the system exit status will be one (i.e., failure). Of course there's nothing to stop you from catching SystemExit either -- it's just another exception, but one that happens to shut down the interpreter in the specified way if it isn't caught & handled. -- http://mail.python.org/mailman/listinfo/python-list
Re: Valgrind memory-checker reports memory problems in Python
[Nathan Bates] > Are the Python developers running Python under Valgrind? Please read Misc/README.valgrind (in your Python distribution). -- http://mail.python.org/mailman/listinfo/python-list
Re: hash() yields different results for different platforms
[Grant Edwards] >> ... >> The low 32 bits match, so perhaps you should just use that >> portion of the returned hash? >> >> >>> hex(12416037344) >> '0x2E40DB1E0L' >> >>> hex(-468864544 & 0x) >> '0xE40DB1E0L' >> >> >>> hex(12416037344 & 0x) >> '0xE40DB1E0L' >> >>> hex(-468864544 & 0x) >> '0xE40DB1E0L' [Qiangning Hong] > Is this relationship (same low 32 bits) guaranteed? No. Nothing about hashes is guaranteed, except that when x and y are of a hashable type, and x == y, then hash(x) == hash(y) too. > Will it change in the future version? That's possible, but not planned. Note that the guts of string hashing in CPython today is implemented via while (--len >= 0) x = (103*x) ^ *p++; where x is C type "long", and the C language doesn't even define what that does (behavior when signed multiplication overflows isn't defined in C). -- http://mail.python.org/mailman/listinfo/python-list
Re: random shuffles
[ Boris Borcic] > x.sort(cmp = lambda x,y : cmp(random.random(),0.5)) > > pick a random shuffle of x with uniform distribution ? Say len(x) == N. With Python's current sort, the conjecture is true if and only if N <= 2. > Intuitively, assuming list.sort() does a minimal number of comparisons to > achieve the sort, No idea what that could mean in a rigorous, algorithm-independent way. > I'd say the answer is yes. But I don't feel quite confortable > with the intuition... can anyone think of a more solid argumentation ? If a list is already sorted, or reverse-sorted, the minimum number of comparisons needed to determine that is N-1, and Python's current sort achieves that. It first looks for the longest ascending or descending run. If comparison outcomes are random, it will decide "yes, already sorted" with probability 1/2**(N-1), and likewise for reverse-sorted. When N > 2, those are larger than the "correct" probabilities 1/(N!). So., e.g., when N=3, the sort will leave the list alone 1/4th of the time (and will reverse the list in-place another 1/4th). That makes the identity and reversal permutations much more likely than "they should be", and the bias is worse as N increases. Of course random.shuffle(x) is the intended way to effect a permutation uniformly at random. -- http://mail.python.org/mailman/listinfo/python-list