Re: random writing access to a file in Python

2006-08-25 Thread Tim Peters
[Claudio Grondi]
> I have a 250 Gbyte file (occupies the whole hard drive space)

Then where is Python stored ;-)?

> and want to change only eight bytes in this file at a given offset of appr. 
> 200
> Gbyte (all other data in that file should remain unchanged).
>
> How can I do that in Python?

Same way you'd do it in C, really:

f = open(PATH_TO_FILE, "r+b")
f.seek(appr. 200 Gbyte)
f.write(A_STRING_CONTAINING_YOUR_NEW_8_BYTES)
f.close()

This depends on your operating system and file system and platform C
library supporting seeks to very large offsets.  For example, Windows
using NTFS does.  Try it.  Python should complain (raise an exception
during the seek() attempt) if your box refuses to cooperate.  Use as
recent a released version of Python as you can.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: time.clock() going backwards??

2006-08-25 Thread Tim Peters
[Giovanni Bajo[
>> I experimented something very strange, a few days ago. I was debugging an
>> application at a customer's site, and the problem turned out to be that
>> time.clock() was going "backwards", that is it was sometimes
>> (randomically) returning a floating point value which was "less than" the
>> value returned by the previous invokation. The computer was a pretty fast
>> one (P4 3Ghz I think, running Windows XP), and this happened only between
>> very close invokations of time.clock().

[Terry Reed]
> I seem to remember this being mentioned before on the list, perhaps by Tim
> Peters.  Perhaps he will chime in.

No real need ;-)  BIOS or HAL bug on a multiprocessor (or maybe even
hyperthreaded) box is by far the most likely cause (and the only cause
ever identified for sure by people who followed up).  Python's C code
slinging QueryPerformanceCounter is exceedingly simple, and isn't a
suspect because of that.  It's on the edge of vague possibility that
Microsoft's compiler generates non-monotonic code for converting
64-bit integer to double:  i.e., Python's C code obviously relies on
that if i and j are _int64 satisfying i >= j, then (double)i >=
(double)j.  If that type-conversion code /is/ monotonic (which is
almost certainly so), then the only ways the C code can fail are if
the HW counter overflows (in which case time.clock() would "jump" a
huge amount), or if the sequence of values returned by
QueryPerformanceCounter is itself non-monotonic at times (which is
consistent with known BIOS/HAL bugs).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Misleading error message when opening a file (on Windows XP SP 2)

2006-08-28 Thread Tim Peters
[Claudio Grondi]
> Here an example of what I mean
> (Python 2.4.2, IDLE 1.1.2, Windows XP SP2, NTFS file system, 80 GByte
> large file):
>
>  >>> f = file('veryBigFile.dat','r')
>  >>> f = file('veryBigFile.dat','r+')
>
> Traceback (most recent call last):
>File "", line 1, in -toplevel-
>  f = file('veryBigFile.dat','r+')
> IOError: [Errno 2] No such file or directory: 'veryBigFile.dat'
>
> Is it a BUG or a FEATURE?

Assuming the file exists and isn't read-only, I bet it's a Windows
bug, and that if you open in binary mode ("r+b") instead I bet it goes
away (this wouldn't be the first large-file text-mode Windows bug).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: naive misuse? (of PyThreadState_SetAsyncExc)

2006-08-28 Thread Tim Peters
[EMAIL PROTECTED]
>> The documentation for PyThreadState_SetAsyncExc says "To prevent naive
>> misuse, you must write your own C extension to call this". Anyone care
>> to list a few examples of such naive misuse?

[and again]
> No? I'll take that then as proof that it's impossible to misuse the
> function.

That's wise ;-)  Stopping a thread asynchronously is in /general/ a
dangerous thing to do, and for obvious reasons.  For example, perhaps
the victim thread is running in a library routine at the time the
asynch exception is raised, and getting forcibly ejected from the
normal control flow leaves a library-internal mutex locked forever.
Or perhaps a catch-all "finally:" clause in the library manages to
release the mutex, but leaves the internals in an inconsistent state.
Etc.  The same kinds of potential disasters accout for why Java
deprecated its versions of this gimmick:

http://java.sun.com/j2se/1.3/docs/guide/misc/threadPrimitiveDeprecation.html

That said, you can invoke PyThreadState_SetAsyncExc() from Python
using the `ctypes` module (which will be included in 2.5, and is
available as an extension module for earlier Pythons).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A Sort Optimization Technique: decorate-sort-dedecorate

2006-08-29 Thread Tim Peters
[Joachim Durchholz]
>>> Wikipedia says it's going from 2NlogN to N. If a sort is massively
>>> dominated by the comparison, that could give a speedup of up to 100%
>>> (approximately - dropping the logN factor is almost irrelevant, what
>>> counts is losing that factor of 2).

[Gabriel Genellina]
>> In fact it's the other way - losing a factor of 2 is irrelevant,
>> O(2N)=O(N). The logN factor is crucial here.

[Joachim Durchholz]
> That's just a question of what you're interested in.
>
> If it's asymptotic behavior, then the O(logN) factor is a difference.
>
> If it's practical speed, a constant factor of 2 is far more relevant
> than any O(logN) factor.

Nope.  Even if you're thinking of base 10 logarithms, log(N, 10) > 2
for every N > 100.  Base 2 logarithms are actually most appropriate
here, and log(N, 2) > 2 for every N > 4.  So even if the "2" made
sense here (it doesn't -- see next paragraph), the log(N) term
dominates it for even relatively tiny values of N.

Now it so happens that current versions of Python (and Perl) use merge
sorts, where the worst-case number of comparisons is roughly N*log(N,
2), not Wikipedia's 2*N*log(N, 2) (BTW, the Wikipedia article
neglected to specify the base of its logarithms, but it's clearly
intended to be 2).  So that factor of 2 doesn't even exist in current
reality:  the factor of log(N, 2) is the /whole/ worst-case story
here, and, e.g., is near 10 when N is near 1000.  A factor of 10 is
nothing to sneeze at.

OTOH, current versions of Python (and Perl) also happen to use
"adaptive" merge sorts, which can do as few as N-1 comparisons in the
best case (the input is already sorted, or reverse-sorted).  Now I'm
not sure why anyone would sort a list already known to be sorted, but
if you do, DSU is a waste of time.  In any other case, it probably
wins, and usually wins big, and solely by saving a factor of (up to)
log(N, 2) key computations.

> (I'm not on the list, so I won't see responses unless specifically CC'd.)

Done.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A Sort Optimization Technique: decorate-sort-dedecorate

2006-08-29 Thread Tim Peters
 [Joachim Durchholz]
>>>>> Wikipedia says it's going from 2NlogN to N. If a sort is massively
>>>>> dominated by the comparison, that could give a speedup of up to 100%
>>>>> (approximately - dropping the logN factor is almost irrelevant, what
>>>>> counts is losing that factor of 2).

[Gabriel Genellina]
>>>> In fact it's the other way - losing a factor of 2 is irrelevant,
>>>> O(2N)=O(N). The logN factor is crucial here.

[Joachim Durchholz]
>>> That's just a question of what you're interested in.
>>>
>>> If it's asymptotic behavior, then the O(logN) factor is a difference.
>>>
>>> If it's practical speed, a constant factor of 2 is far more relevant
>>> than any O(logN) factor.

[Tim Peters]
>> Nope.  Even if you're thinking of base 10 logarithms, log(N, 10) > 2
>> for every N > 100.  Base 2 logarithms are actually most appropriate
>> here, and log(N, 2) > 2 for every N > 4.  So even if the "2" made
>> sense here (it doesn't -- see next paragraph), the log(N) term
>> dominates it for even relatively tiny values of N.

[Joachim Durchholz]
> Whether this argument is relevant depends on the constant factors
> associated with each term.

I'm afraid you're still missing the point in this example:  it's not
just that Python's (& Perl's) current sorts do O(N*log(N)) worst-case
comparisons, it's that they /do/ N*log(N, 2) worst-case comparisons.
O() notation isn't being used, and there is no "constant factor" here:
 the count of worst-case comparisons made is close to exactly N*log(N,
2), not to some mystery-constant times N*log(N, 2).  For example,
sorting a randomly permuted array with a million distinct elements
will require nearly 100*log(100, 2) ~= 100 * 20 = 20
million comparisons, and DSU will save about 19 million key
computations in this case.  O() arguments are irrelevant to this, and
the Wikipedia page you started from wasn't making an O() argument
either:

http://en.wikipedia.org/wiki/Schwartzian_transform

For an efficient ordinary sort function, the number of invocations of the
transform function goes from an average of 2nlogn to n;

No O() in sight, and no O() was intended there either.  You do exactly
N key computations when using DSU, while the hypothetical "efficient
ordinary sort function" the author had in mind does about 2*N*log(N,
2) key computations when not using DSU.  That's overly pessimistic for
Python's & Perl's current sort functions, where no more than N*log(N,
2) key computations are done when not using DSU.  The /factor/ of key
computations saved is thus as large as N*log(N, 2) / N = log(N, 2).
O() behavior has nothing to do with this result, and the factor of
log(N, 2) is as real as it gets.  If key extraction is at all
expensive, and N isn't trivially small, saving a factor of log(N, 2)
key extractions is /enormously/ helpful.

If this is sinking in now, reread the rest of my last reply that got
snipped hre.

> Roughly speaking, if the constant factor on the O(N) term is 100 and the
> constant factor on the O(logN) term is 1, then it's still irrelevant.

As above, it's talking about O() that's actually irrelevant in this
specific case.

> My point actually is this: big-Oh measures are fine for comparing
> algorithms in general, but when it comes to optimizing concrete
> implementations, its value greatly diminishes: you still have to
> investigate the constant factors and all the details that the big-Oh
> notation abstracts away.

That's true, although the argument here wasn't actually abstracting
away anything.  You've been adding abstraction to an argument that
didn't have any ;-)

> From that point of view, it's irrelevant whether some part of the
> algorithm contributes an O(1) or an O(logN) factor: the decision where
> to optimize is almost entirely dominated by the constant factors.

While true in some cases, it's irrelevant to this specific case.
More, in practice a factor of O(log(N)) is almost always more
important "for real" than a factor of O(1) anyway -- theoretical
algorithms hiding gigantic constant factors in O() notion are very
rarely used in real life.  For example, the best-known algorithm for
finding the number of primes <= x has O(sqrt(x)) time complexity, but
AFAIK has /never/ been implemented because the constant factor is
believed to be gigantic indeed.  Theoretical CompSci is full of
results like that, but they have little bearing on practical
programming.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A Sort Optimization Technique: decorate-sort-dedecorate

2006-08-29 Thread Tim Peters
[/T]
>> OTOH, current versions of Python (and Perl)

[/F]
> just curious, but all this use of (& Perl) mean that the Perl folks have
> implemented timsort ?

A remarkable case of independent harmonic convergence:

http://mail.python.org/pipermail/python-dev/2002-July/026946.html

Come to think of it, I don't actually know whether a /released/ Perl
ever contained the development code I saw.  Looks like it got added to
Perl 5.8:

http://perldoc.perl.org/sort.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: GC and security

2006-08-30 Thread Tim Peters
[Aahz]
>> Assuming you're talking about CPython, strings don't really participate
>> in garbage collection.  Keep in mind that the primary mechanism for
>> reaping memory is reference counting, and generally as soon as the
>> refcount for an object goes to zero, it gets deleted from memory.

[Les Schaffer]
> ok so far ...

Not really ;-)  Refcounting is /a/ "garbage collection" technique, and
drawing a distinction between refcount-based reclamation and other
ways CPython reclaims memory has no bearing on your question.

>> Garbage collection only gets used for objects that refer to other
>> objects, so it would only apply if string refcounts are being held by
>> GC-able objects.

> you lost me by here ...

That tends to happen when an irrelevant distinction gets made ;-)

> is there something different about string objects than other objects in
> Python?

Not anything relevant wrt garbage collection.  It may be relevant that
strings are immutable, since that prevents you from overwriting a
string's contents yourself.

> are you saying EVERY string in Python stays in memory for the lifetime
> of the app?

He wasn't, and they don't.

>> Also keep in mind, of course, that deleting objects has nothing to do
>> with whether the memory gets overwritten...

> no chance of being overwritten until deleted, no?

True.

> and once deleted over time there is some probability of being
> overwritten, no?

True.  Also true if you add the intended "non-zero" qualifier to
"probability" ;-)

> and i am curious how that works.

Purely accidental -- nothing guaranteed -- details can (& do) change
across releases.  Read obmalloc.c for a tour of the accidents du jour.

> it sounds like you are saying once a string, always the same string, in 
> python.
> if thats true, i am glad i know that.

Not true, so be glad to forget it.

A curious possibility:  if you do a debug build of Python, obmalloc.c
arranges to overwrite all of an object's memory as soon as the object
is reclaimed (by any means, refcounting or otherwise).  That wasn't
for "security" (faux or otherwise), it's to make it easier to detect
buggy C code referencing freed memory.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to get the "longest possible" match with Python's RE module?

2006-09-11 Thread Tim Peters
[Licheng Fang]
> ...
> I want to know if there is some way to make Python RE behave like grep
> does,

Not in general, no.  The matching strategies couldn't be more
different, and that's both deep and intentional.  See Friedl's book
for details:

http://regex.info/

> or do I have to change to another engine?

Yes, if POSIX regexp semantics are what you require.  Several years
ago there was an extension module for Python supplying POSIX
semantics, but I couldn't find anything current just now in a minute
of searching.  You may be more motivated to search longer ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to get the "longest possible" match with Python's RE module?

2006-09-11 Thread Tim Peters
[Licheng Fang[
> ...
> In fact, what I'm doing is handle a lot of regular expressions. I
> wanted to build VERY LONG regexps part by part and put them all into a
> file for easy modification and maintenance. The idea is like this:
>
> (*INT) = \d+
> (*DECIMAL) = (*INT)\.(*INT)
> (*FACTION) = (*DECIMAL)/(*DECIMAL)
> (*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
> ... ...
>
> What's inside the sytactically wrong (* and ) is something to be
> replaced, and then I wrote a little script to do the string
> substitution, to get very long regexps to be compiled. I thought that
> might be a way to handle long and complex regexps, and in this course I
> encountered the problem with the semantics of '|'.
>
> My approach may sound desperate and funny, but please, do you have any
> good idea as to how to handle long and complex regexps?

My best advice is to find a different approach entirely.  For example,
build a parser using parser technology, and use regexps in that /only/
to do gross tokenization ("string of digits", "decimal point", ...):
build the rest with a grammar.

Regexps are a brittle tool, best tolerated in small doses.  For an
"NFA" implementation like Python's, you're likely to experience poor
speed when combining many complex regexps, and /especially/ when
alternatives are ambiguous wrt prefixes (and yours are, else you
wouldn't have a problem with "longest match" versus "some shorter
match" to begin with.  OTOH, under a "DFA" implementation such as
POSIX grep's, you're likely to experience exponential memory
requirements (DFA implementations can need to build enormous state
machines, tracing out in advance all possible paths through all the
regexps applied to all possible input strings).

Just sounds to me like the wrong tool for the job.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to get the "longest possible" match with Python's RE module?

2006-09-12 Thread Tim Peters
[Licheng Fang]
>> Oh, please do have a look at the second link I've posted. There's a
>> table comparing the regexp engines. The engines you've tested probably
>> all use an NFA implementation.

[Bryan Olson]
> Unfortunately, the stuff about NFA's is wrong. Friedl's awful
> book

Strongly disagree:  it's an excellent book about the /pragmatics/ of
using "regular expressions" as most widely implemented.  It's not at
all about "regular expressions" in the  CompSci sense of the term,
which appears to be your complaint.

> was the first time I saw this confusion about what NFA is;
> I don't know if he originated the mess or just propagated it.

As far as I could tell at the time, he originated it.  I'm not happy
about that either.

> "Nondeterministic finite automata" is well defined in theory
> of computation. The set of languages accepted by NFA's is
> exactly the same as the set accepted by DFA's.

And which has very little to do with "regular expressions" as most
widely implemented -- gimmicks like backreferences are wholly outside
the DFA formalism.

> What Python uses is search-and-backtrack. Unfortunately such
> engines don't have much theory behind them, and it's hard to
> reason generally about what they do.

Yup X 3, and the last is precisely why Friedl's book is valuable for
people using "NFA" implementations:  Friedl does a good job of
explaining when and why you're likely to trigger atrocious runtime
performance, and gives practical general techniques for avoiding those
problems.  None of that has anything to do with CompSci regexps
either, but is vital knowledge for people hoping to make happy
non-trivial use of Python/Perl/etc regexps.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to get the "longest possible" match with Python's RE module?

2006-09-12 Thread Tim Peters
[Licheng Fang]
>> Basically, the problem is this:
>>
>> >>> p = re.compile("do|dolittle")
>> >>> p.match("dolittle").group()
>> 'do'

...

>> The Python regular expression engine doesn't exaust all the
>> possibilities, but in my application I hope to get the longest possible
>> match, starting from a given point.
>>
>> Is there a way to do this in Python?

[Bryan Olson]
> Yes. Here's a way, but it sucks real bad:
>
>
>  def longest_match(re_string, text):
> regexp = re.compile('(?:' + re_string + ')$')
>  while text:
> m = regexp.match(text)
>  if m:
>  return m
>  text = text[:-1]
>  return None

If you want to try something like that, note that the match() method
accepts optional slice arguments, so the "text = text[:-1]" business
can be replaced with much quicker little-integer arithmetic.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to get the "longest possible" match with Python's RE module?

2006-09-12 Thread Tim Peters
[Bryan Olson]
>>> Unfortunately, the stuff about NFA's is wrong. Friedl's awful
>>> book

[Tim Peters]
>> Strongly disagree: [...]

[Bryan]
> I know I'm disagreeing with a lot of smart people in panning
> the book.

That's allowed :-)

>>> What Python uses is search-and-backtrack. Unfortunately such
>>> engines don't have much theory behind them, and it's hard to
>>> reason generally about what they do.

>> Yup X 3, and the last is precisely why Friedl's book is valuable for
>> people using "NFA" implementations:  Friedl does a good job of
>> explaining when and why you're likely to trigger atrocious runtime
>> performance, and gives practical general techniques for avoiding those
>> problems.  None of that has anything to do with CompSci regexps
>> either, but is vital knowledge for people hoping to make happy
>> non-trivial use of Python/Perl/etc regexps.

> I read the first edition, minus some engine-specific stuff.
> General techniques are what I want, and I didn't see them in the
> book. He had things to do and not do, but it didn't add up to
> building (ir-)regular expressions with reliably tractable
> behavior.

My guess then is that the book moved at too slow a pace for you, and
you got bored.  The most valuable general technique he (eventually ;-)
explained he called "unrolling", and consists of writing a regexp in
the form:

normal* (?: special normal* )*

where the sets of characters with which `normal` and `special` can
start are disjoint.  Under most "NFA" implementation, such a regexp is
immune to most bad behaviors, whether finding very long matches, or in
searching a very long string that happens not to contain any matches.

Without knowing this, under many implementations it's very easy to end
up writing a regexp that matches quickly when a short match exists,
but "blows up" (stack fault, or some checked internal limit hit) if
only a very long match exists; and/or takes even exponential time to
fail to match when a match doesn't exist.

For example, a direct application is writing a robust regular
expression to match C string literals (also one form of Python string
literal:  double-quote character at each end, and backslash escapes,
including backslash-newline to span lines):

" [^"\\\n]* (?: \\[\x00-\xff] [^"\\\n]* )* "

The "normal case" is that it's just a regular old character:  not a
double quote, backslash, or newline.  The "special case" is a
backslash followed by any character whatsoever.  The "normal" and
"special" cases obviously don't intersect, so the interior of this
regexp (i.e., ignoring the double quotes at each end)  fits the
"unrolling" pattern to a tee.

As a result, you can use it with reasonably high confidence.  For
/why/ that's so, read the rest of his book ;-)  In effect, it gives
the search engine only one possible course of action at each character
in the target string.  So long as that's "normal", it has to stay in
the "normal" part of the regexp, and as soon as it's "special" it has
to leave the "normal" part.  The intent is to eliminate any
possibility for backtracking, thus ensuring predictable runtime and
ensuring that no form of internal backtracking stack needs to be used
(let alone hit its limit).

The unrolling idea was new to me, and as soon as I learned it I
replaced lots of regexps in the Python library with rewritten versions
that demonstrably performed very much better -- even allowed closing
some bug reports concerning extreme cases (extremely long matches, and
unbearable runtime on long strings without any matches).

A different lesson to take from this is that nobody is likely to
stumble into effective "NFA" techniques based on luck or "common
sense".  It's a set of obscure & specialized learned skills.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: min() & max() vs sorted()

2006-09-14 Thread Tim Peters
[MRAB]
> Some time after reading about Python 2.5 and how the built-in functions
> 'min' and 'max' will be getting a new 'key' argument, I wondered how
> they would treat those cases where the keys were the same, for example:
>
> L = ["four", "five"]
> print min(L, key = len), max(L, key = len)
>
> The result is:
>
> ('four', 'four')

min() and max() both work left-to-right, and return the minimal or
maximal element at the smallest index.

> I would've thought that min(...) should return the same as
> sorted(...)[0] (which it does)

It does, but only because Python's sort is stable, so that minimal
elements retain their original relative order.  That implies that the
minimal element with smallest original index will end up at index 0
after sorting.

> and that max(...) should return the same as sorted(...)[-1] (which it 
> doesn't).

Right -- although I don't know why you'd expect that.

> I think that's just down to a subtlety in the way that 'max' is written.

It's straightforward, skipping error cases:

def max(*args):
 is_first_arg = True
 for arg in args:
 if is_first_arg:
 winner = arg
 is_first_arg = False
elif arg > winner:# only difference in min() is "<" here instead
 winner = arg
return winner
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: how do you convert and array of doubles into floats?

2006-09-15 Thread Tim Peters
[Marc 'BlackJack' Rintsch]
>> What about:
>>
>> b = array.array('f', a)

[Diez B. Roggisch]
> AFAIK d and f are synonym for arrays, as python doesn't distinguish
> between these two on a type-level. And double it is in the end.

While Python has no type of its own corresponding to the native C
`float`, the `array` and `struct` modules do understand the native C
`float` .  A Python float (same as a C `double`) gets converted to a C
`float` when stored into one of those, and a C `float` is converted to
a Python float (C `double`) when a value is extracted.

>>> from array import array
>>> x = 1.01
>>> x
1.01
>>> array('d', [x])[0]   # full precision is preserved
1.01
>>> array('d', [x])[0] == x
True
>>> array('f', [x])[0]   # trailing bits are lost
1.0
>>> array('f', [x])[0] == x
False
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: min() & max() vs sorted()

2006-09-15 Thread Tim Peters
[MRAB]
>>> Some time after reading about Python 2.5 and how the built-in functions
>>> 'min' and 'max' will be getting a new 'key' argument, I wondered how
>>> they would treat those cases where the keys were the same, for example:
>>>
>>> L = ["four", "five"]
>>> print min(L, key = len), max(L, key = len)
>>>
>>> The result is:
>>>
>>> ('four', 'four')

[Tim Peters]
>> min() and max() both work left-to-right, and return the minimal or
>> maximal element at the smallest index.

[MRAB]
> It doesn't say that in the documentation.

Right, the /language/ doesn't define anything about which specific
minimal or maximal element is returned.  Since you specifically
mentioned Python 2.5, I figured you were asking about CPython -- which
has always behaved in the way I explained.

>>> I would've thought that min(...) should return the same as
>>> sorted(...)[0] (which it does)

>> It does, but only because Python's sort is stable, so that minimal
>> elements retain their original relative order.  That implies that the
>> minimal element with smallest original index will end up at index 0
>> after sorting.

>>> and that max(...) should return the same as sorted(...)[-1] (which
it doesn't).

>> Right -- although I don't know why you'd expect that.

> Strings have index(), find(), etc which work left-to-right and
> rindex(), rfind(), etc which work right-to-left.
>
> Lists have index() but not rindex().
>
> I just thought that if min() and max() work left-to-right then for
> completeness there should also be rmin() and rmax(); alternatively,
> min() should return sorted()[0] and max() should return sorted()[-1]
> for symmetry (my personal preference).

If you were to make either of those a feature request, I don't expect
they'd gain traction -- I expect "who cares?" would be the common
challenge, and "I do" wouldn't silence it ;-)  Compelling use cases
sometimes work to get a new feature, but "completeness" or "symmetry"
almost never do on their own (they function more as sanity checks on
proposed solutions to use cases).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to get the "longest possible" match with Python's RE module?

2006-09-16 Thread Tim Peters
[Tim Peters]
>> [...]  The most valuable general technique [Friedl] (eventually ;-)
>> explained he called "unrolling", and consists of writing a regexp in
>> the form:
>>
>>normal* (?: special normal* )*
>>
>> where the sets of characters with which `normal` and `special` can
>> start are disjoint.  Under most "NFA" implementation, such a regexp is
>> immune to most bad behaviors, whether finding very long matches, or in
>> searching a very long string that happens not to contain any matches.

[Bryan Olson]
> So how general is that?

Between 6 and 7 ;-)

> The books just says "certain common expressions". I think the real
> answer is that one can unroll exactly the true regular expressions.

I'm sure he didn't have that much in mind; e.g., it's a real strain to
fit a fat alternation of fixed strings into that "pattern"
(cat|dog|wombat|weasel|...).'

> The unrolling is essentially resolving the states of a DFA by hand.

Oh yes, for those who've studied the theoretical underpinnings.  Most
people have not.  Much of what he gives as practical advice amounts to
ways to "trick" an "NFA" engine into doing what a "DFA" engine would
do, but explained from the POV of how a backtracking search engine
works.  And it makes /sense/ at that level too, divorced from any
knowledge of how a linear-time automaton might be constructed in
theory; e.g., you "want to" give the  search engine only one choice
because you want to avoid the possibility of needing to backtrack
later, not because you're picturing a theoretical linear-time
automaton and striving to mimic its behavior.

That explanation may not work for you, but it's effective for people
who haven't studied the theory here.  Such explanations also extend
naturally to gimmicks like backreferences,  which have no counterpart
in CompSci regexps.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Hardlinks on NTFS

2006-09-17 Thread Tim Peters
[Wildemar Wildenburger]
>> I'm thinking of letting my program create hardlinks (or symlinks). I
>> know python allows doing this for ext, reiser and the like, but
>> apparently not for ntfs systems.
>> Is there any package out there that lets me create links in a platform
>> independent way?

[Calvin Spealman]
> Why isn't NTFS handled the same as any other hardlink supporting
> filesystem? Is it a choice or a bug?

None of the above.  It's not the filesystem, it's what the platform C
supports.  POSIX specifies a small pile of standard routines for
manipulating hard and symbolic links, and Python exposes those /when/
they're available.  Microsoft's C supplies none of them.

So it's a question of someone on Windows wanting this enough to
volunteer to write code, tests, and docs, and volunteer to maintain, a
pile of Microsoft-specific code to provide workalike functionality on
Windows.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Installing 2.5 with 2.4?

2006-09-20 Thread Tim Peters
[Duncan Booth]
>> Windows is only smart enough to avoid duplicate entries if you tell it
>> to do that. e.g.
>>
>> PATH c:\python25;c:\python25\scripts;%PATH:c:\python25;c:\python25\scripts;=%
>>
>> will add the two Python 2.5 folders to the head of the path without
>> duplicating them.

[John Machin[
> Wow .. I didn't know that! What's the syntax? Something like
> %variablename[:oldtext=[newtext]]%
> ?

Yup.

> Where is this documented?

>From a DOS box (cmd.exe), enter

set /?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Maths error

2007-01-09 Thread Tim Peters
[Tim Peters]
...
>|> Well, just about any technical statement can be misleading if not
>|> qualified to such an extent that the only people who can still
>|> understand it knew it to begin with <0.8 wink>.  The most dubious
>|> statement here to my eyes is the intro's "exactness carries over
>|> into arithmetic".  It takes a world of additional words to explain
>|> exactly what it is about the example given (0.1 + 0.1 + 0.1 - 0.3 =
>|> 0 exactly in decimal fp, but not in binary fp) that does, and does
>|> not, generalize.  Roughly, it does generalize to one important
>|> real-life use-case:  adding and subtracting any number of decimal 
>|> quantities delivers the exact decimal result, /provided/ that
>|> precision is set high enough that no rounding occurs.

[Nick Maclaren]
> Precisely.  There is one other such statement, too: "Decimal numbers
> can be represented exactly."  What it MEANS is that numbers with a
> short representation in decimal can be represented exactly in decimal,
> which is tautologous, but many people READ it to say that numbers that
> they are interested in can be represented exactly in decimal.  Such as
> pi, sqrt(2), 1/3 and so on 

Huh.  I don't read it that way.  If it said "numbers can be ..." I 
might, but reading that way seems to requires effort to overlook the 
"decimal" in "decimal numbers can be ...".

[attribution lost]
>|>> and how is decimal no better than binary?
 
>|> Basically, they both lose info when rounding does occur.  For
>|> example, 

> Yes, but there are two ways in which binary is superior.  Let's skip
> the superior 'smoothness', as being too arcane an issue for this
> group,

With 28 decimal digits used by default, few apps would care about this 
anyway.

> and deal with the other.  In binary, calculating the mid-point
> of two numbers (a very common operation) is guaranteed to be within
> the range defined by those numbers, or to over/under-flow.
>
> Neither (x+y)/2.0 nor (x/2.0+y/2.0) are necessarily within the range
> (x,y) in decimal, even for the most respectable values of x and y.
> This was a MAJOR "gotcha" in the days before binary became standard,
> and will clearly return with decimal.

I view this as being an instance of "lose info when rounding does 
occur".  For example,

>>> import decimal as d
>>> s = d.Decimal("." + "9" * d.getcontext().prec)
>>> s
Decimal("0.")
>>> (s+s)/2
Decimal("1.000")
>>> s/2 + s/2
Decimal("1.000")

"The problems" there are due to rounding error:

>>> s/2  # "the problem" in s/2+s/2 is that s/2 rounds up to exactly 1/2
Decimal("0.5000")

>>> s+s # "the problem" in (s+s)/2 is that s+s rounds up to exactly 2
Decimal("2.000")

It's always something ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Maths error

2007-01-10 Thread Tim Peters
[Tim Peters]
...
>> Huh.  I don't read it that way.  If it said "numbers can be ..." I 
>> might, but reading that way seems to requires effort to overlook the 
>> "decimal" in "decimal numbers can be ...".

[Nick Maclaren]
> I wouldn't expect YOU to read it that way,

Of course I meant "putting myself in others' shoes, I don't ...".

> but I can assure you from experience that many people do.

Sure.  Possibly even most.  Short of writing a long & gentle tutorial, 
can that be improved?  Alas, most people wouldn't read that either <0.5 
wink>.

> What it MEANS is "Numbers with a short representation in decimal

"short" is a red herring here:  Python's Decimal constructor ignores the 
precision setting, retaining all the digits you give.  For example, if 
you pass a string with a million decimal digits, you'll end up with a 
very fat Decimal instance -- no info is lost.

> can be represented exactly in decimal arithmetic", which is
> tautologous.  What they READ it to mean is "One advantage of
> representing numbers in decimal is that they can be represented
> exactly", and they then assume that also applies to pi, sqrt(2),
> 1/3 
>
> The point is that the "decimal" could apply equally well to the
> external or internal representation and, if you aren't fairly
> clued-up in this area, it is easy to choose the wrong one.

Worse, I expect most people have no real idea of that there's a possible 
difference between internal and external representations.  This is often 
given as a selling point for decimal arithmetic:  it's WYSIWYG in ways 
binary fp can't be (short of inventing power-of-2 fp representations for 
I/O, which few people would use).

[attribution lost]
>>>>> and how is decimal no better than binary?

[Tim]
>>>> Basically, they both lose info when rounding does occur.  For
>>>> example, 

[Nick]
>>> Yes, but there are two ways in which binary is superior.  Let's
>>> skip the superior 'smoothness', as being too arcane an issue for
>>> this group,

>> With 28 decimal digits used by default, few apps would care about
>> this anyway.

> Were you in the computer arithmetic area during the "base wars" of the
> 1960s and 1970s that culminated with binary winning out?

Yes, although I came in on the tail end of that and never actually used 
a non-binary machine.

> A lot of very well-respected numerical analysts said that larger bases
> led to a faster build-up of error (independent of the precision).  My
> limited investigations indicated that there was SOME truth in that,
> but it wasn't a major matter; I never say the matter settled
> definitively.

My point was that 28 decimal digits of precision is far greater than 
supplied even by 64-bit binary floats today (let alone the smaller sizes 
in most-common use back in the 60s and 70s).  "Pollution" of low-order 
bits is far less of a real concern when there are some number of low-
order bits you don't care about at all.

>>> and deal with the other.  In binary, calculating the mid-point
>>> of two numbers (a very common operation) is guaranteed to be within
>>> the range defined by those numbers, or to over/under-flow.
>>>
>>> Neither (x+y)/2.0 nor (x/2.0+y/2.0) are necessarily within the range
>>> (x,y) in decimal, even for the most respectable values of x and y.
>>> This was a MAJOR "gotcha" in the days before binary became standard,
>>> and will clearly return with decimal.

>> I view this as being an instance of "lose info when rounding does 
>> occur".  For example,
 
> No, absolutely NOT!

Of course it is.  If there were no rounding errors, the computed result 
would be exactly right -- that's darned near tautological too.  You 
snipped the examples I gave showing exactly where and how rounding error 
created the problems in (x+y)/2 and x/2+y/2 for some specific values of 
x and y using decimal arithmetic.  If you don't like those examples, 
supply your own, and if you get a similarly surprising result you'll 
find rounding error(s) occur(s) in yours too.

It so happens that rounding errors in binary fp can't lead to the same 
counterintuitive /outcome/, essentially because x+x == y+y implies x == 
y in base 2 fp, which is indeed a bit of magic specific to base 2.  The 
fact that there /do/ exist fp x and y such that x != y yet x+x == y+y in 
bases > 2 is entirely due to fp rounding error losing info.

> This is an orthogonal matter,

Disagree.

> and is about the loss of an important invariant when using any base
> above 2.

It is that.

> Back in the days when the

Re: Maths error

2007-01-13 Thread Tim Peters
[Nick Maclaren]
>> ...
>> Yes, but that wasn't their point.  It was that in (say) iterative
>> algorithms, the error builds up by a factor of the base at every
>> step. If it wasn't for the fact that errors build up, almost all
>> programs could ignore numerical analysis and still get reliable
>> answers! 
>>
>> Actually, my (limited) investigations indicated that such an error
>> build-up was extremely rare - I could achieve it only in VERY
>> artificial programs.  But I did find that the errors built up faster
>> for higher bases, so that a reasonable rule of thumb is that 28
>> digits with a decimal base was comparable to (say) 80 bits with a
>> binary base. 

[Hendrik van Rooyen]
> I would have thought that this sort of thing was a natural consequence 
> of rounding errors - if I round (or worse truncate) a binary, I can be 
> off by at most one, with an expectation of a half of a least 
> significant digit, while if I use hex digits, my expectation is around
> eight, and for decimal around five... 

Which, in all cases, is a half ULP at worst (when rounding -- as 
everyone does now).

> So it would seem natural that errors would propagate 
> faster on big base systems, AOTBE, but this may be 
> a naive view.. 

I don't know of any current support for this view.  It the bad old days, 
such things were often confused by architectures that mixed non-binary 
bases with "creative" rounding rules (like truncation indeed), and it 
could be hard to know where to "pin the blame".

What you will still see stated is variations on Kahan's telegraphic 
"binary is better than any other radix for error analysis (but not very 
much)", listed as one of two techincal advantages for binary fp in:

http://www.cs.berkeley.edu/~wkahan/MktgMath.pdf

It's important to note that he says "error analysis", not "error 
propagation" -- regardless of base in use, rounding is good to <= 1/2 
ULP.  A fuller elementary explanation of this can be found in David 
Goldberg's widely available "What Every Computer Scientist Should Know 
About Floating-Point", in its "Relative Error and Ulps" section.  The 
short course is that rigorous forward error analysis of fp algorithms is 
usually framed in terms of relative error:  given a computed 
approximation x' to the mathematically exact result x, what's the 
largest possible absolute value of the mathematical

   r = (x'-x)/x

(the relative error of x')?  This framework gets used because it's more-
or-less tractable, starting by assuming inputs are exact (or not, in 
which case you start by bounding the inputs' relative errors), then 
successively computing relative errors for each step of the algorithm.  
Goldberg's paper, and Knuth volume 2, contain many introductory examples 
of rigorous analysis using this approach.

Analysis of relative error generally goes along independent of FP base.  
It's at the end, when you want to transform a statement about relative 
error into a statement about error as measured by ULPs (units in the 
last place), where the base comes in strongly.  As Goldberg explains, 
the larger the fp base the sloppier the relative-error-converted-to-ULPs 
bound is -- but this is by a constant factor independent of the 
algorithm being analyzed, hence Kahan's "... better ... but not very 
much".  In more words from Goldberg:

Since epsilon [a measure of relative error] can overestimate the
effect of rounding to the nearest floating-point number by the
wobble factor of B [the FP base, like 2 for binary or 10 for
decimal], error estimates of formulas will be tighter on machines
with a small B.

When only the order of magnitude of rounding error is of interest,
ulps and epsilon may be used interchangeably, since they differ by
at most a factor of B.

So that factor of B is irrelevant to most apps most of the time.  For a 
combination of an fp algorithm + set of inputs near the edge of giving 
gibberish results, of course it can be important.  Someone using 
Python's decimal implementation has an often very effective workaround 
then, short of writing a more robust fp algorithm:  just boost the 
precision.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: doctest problem with null byte

2007-01-25 Thread Tim Peters
[Stuart D. Gathman]
> I am trying to create a doctest test case for the following:
>
> def quote_value(s):
> """Quote the value for a key-value pair in Received-SPF header
> field if needed.  No quoting needed for a dot-atom value.
>
> >>> quote_value(r'abc\def')
> '"abcdef"'
> >>> quote_value('abc..def')
> '"abc..def"'
> >>> quote_value('')
> '""'
> >>> quote_value('-all\x00')
> '"-allx00"'
> ...
> 
> However, doctest does *not* like the null byte in the example (yes,
> this happens with real life input):
> **
> File "/home/stuart/pyspf/spf.py", line 1453, in spf.quote_value
> Failed example:
> quote_value('-all')
> Exception raised:
> Traceback (most recent call last):
>   File
>   "/var/tmp/python2.4-2.4.4c1-root/usr/lib/python2.4/doctest.py", 
>  line 1248, in __run
> compileflags, 1) in test.globs
> TypeError: compile() expected string without null bytes
> **
> 
> How can I construct a test cast for this?

As the docs say, doctest examples are parsed /twice/:  once when Python 
compiles the module and creates string objects, and again when doctest 
extracts and passes example substrings to the compile() function (which 
you can see in your traceback).

The first time, the \x00 in the

 quote_value('-all\x00')

portion of your docstring got changed into an honest-to-goodness NUL 
byte.  Passing that to compile() can't work.

To alleviate that, and the "leaning toothpick syndrome" in your other 
examples, a simple approach is to make your docstring a raw string 
(stick the letter 'r' in front of the opening triple-quote).

For example, this works fine:

def f():
r"""
>>> len("\\")
1
>>> ord("\x00")
0
"""

Because it's a raw string, Python does /not/ change the

\x00

portion into a NUL byte when it compile the module.  Instead the 
docstring continues to contain the 4-character substring:

\x00

and that's what compile() needs to see.

To perform the same test without using a raw string requires slamming in 
more backslashes (or other ad-hoc escaping tricks):

def g():
"""
>>> len("")
1
>>> ord("\\x00")
0
"""
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Signed zeros: is this a bug?

2007-03-11 Thread Tim Peters
[attribution lost]
...
>>> Yup: the workaround seems to be as simple as replacing all 
occurrences
>>> of -0.0 with -(0.0).  I'm embarrassed that I didn't figure this out
>>> sooner.
>>> 
>>> >>> x, y = -(0.0), 0.0
>>> >>> x, y
>>> (-0.0, 0.0)

[Alex Martelli]
>> Glad it works for you, but it's the kind of workaround that could
>> break with any minor tweak/optimization to the compiler...
>> very fragile:-(.

[also Alex]
> OTOH, Python 2.4 works just fine...:
>
> Python 2.4.3 (#1, Apr  7 2006, 10:54:33) 
> [GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> 0.0,-0.0
> (0.0, -0.0)
> >>> -0.0,0.0
> (-0.0, 0.0)
>
> so it seems to be very specifically a 2.5 problem.

It's a bug that keeps resurfacing, probably because there's no portable 
way to test that it stays fixed :-( (it's not an accident that the OP 
relied on atan2 to distinguish +0.0 from -0.0!  they act the same in 
/almost/ all contexts).

Python's grammar doesn't have negative numeric literals.  Instead

"-" CONSTANT_LITERAL

looks like unary minus applied to the non-negative CONSTANT_LITERAL.  
All early versions of Python worked that way too.

The first time the +/- 0.0 bug occurred is when the front end was 
changed to act as if

"-" CONSTANT_LITERAL

were a literal in its own right, and then that +0.0 == -0.0 caused the 
first instance of either in a compilation unit to be used as the value 
for all instances of both throughout the compilation unit.  That was 
fixed by refusing to apply the optimimization if the value of 
CONSTANT_LITERAL was a float that compared equal to 0.0.

2.5 introduced a new front end and more ambitious constant-folding, and 
I expect the bug showed up again due to one of those.

Note:  printing the value of a float 0 may not reveal its sign.  IIRC, 
glibc's float-to-string routines do display the sign of 0, but 
Microsoft's do not.  However, atan2() is sensitive to the sign under 
both glibm and Microsoft's libm equivalent.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: math.pow(x,y)

2006-06-11 Thread Tim Peters
[Wojciech Muła]
>> You have to use operator **, i.e. 34564323**456356

Or the builtin pow() instead of math.pow().

[Gary Herron]
> That's not very practical. That computation will produce a value with
> more than 3.4 million digits.

Yes.

> (That is, log10(34564323)*456356 = 3440298.) Python will attempt this, but
> I was not patient enough to see if it could calculate an answer today (or even
> this week).

On my box it took less than 30 seconds to do

x = 34564323**456356

If you try to _display_ that as a decimal string, it will take
enormously longer.  Python uses a power-of-2 base internally, and
conversion to decimal takes time quadratic in the number of digits.
Doing y = hex(x) instead is very fast (small fraction of a second).

> I doubt that you really *want* all 3.4 million digits. So what is it you
> really want? A scientific or engineering result as a floating point
> number accurate to some reasonable number of digits? That integer value
> modulo some other integer (as used in various cryptology schemes)?

For example, if you only want the last 6 digits, pow(34564323, 456356,
100) returns 986961 in an eyeblink.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Numerics, NaNs, IEEE 754 and C99

2006-06-15 Thread Tim Peters
[Nick Maclaren]
 Firstly, a FAR more common assumption is that integers wrap in twos'
 complement - Python does not do that.

[Grant Edwards]
>>> It used to

[Fredrik Lundh]
>> for integers ?  what version was that ?

[Grant]
> Am I remebering incorrectly?

Mostly but not entirely.

> Didn't the old fixed-width integers operate modulo-wordsize?

Not in Python.

>  I distinctly remember having to rewrite a bunch of checksum code when
> fixed-width integers went away.

Best guess is that you're working on a 32-bit box, and remember this
Python <= 2.2 behavior specific to the left shift operator:

>>> 1 << 31
-2147483648
>>> 1 << 32
0

On a 64-bit box (except Win64, which didn't exist at the time ;-)),
those returned 2**31 and 2**32 instead, while "1 << 64" wrapped to 0
(etc).

Python 2.3 starting warning about that:

>>> 1 << 31
__main__:1: FutureWarning: x<>> 1 << 31
2147483648L
>>> 1 << 32
4294967296L

+ - * / on short ints always complained about overflow before int-long
unification, although IIRC very early Pythons missed the overflow
error in (on a 32-bit box) int(-(2L**31))/-1.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PyObject_SetItem(..) *always* requires a Py_INCREF or not?

2006-06-17 Thread Tim Peters
[EMAIL PROTECTED]
> I would think everytime you add an item to a list you must increase
> reference count of that item.

_Someone_ needs to.  When the function called to add the item does the
incref itself, then it would be wrong for the caller to also incref
the item.

> http://docs.python.org/api/refcountDetails.html has an example that
> seems to contradict that
>
> int
> set_all(PyObject *target, PyObject *item)
> {
> int i, n;
>
> n = PyObject_Length(target);
> if (n < 0)
> return -1;
> for (i = 0; i < n; i++) {
> if (PyObject_SetItem(target, i, item) < 0)
> return -1;
> }
> return 0;
> }
>
> *WHY* don't you need a Py_INCREF(item); in the for loop!?!?!?

You should take a break, and read that section again later ;-)

The _point_ of that example is in fact to illustrate that you don't
need to incref when calling PyObject_SetItem().  While I can't know,
I'm guessing that you're still "seeing" PyList_SetItem() here, which
has radically different behavior.  PyList_SetItem() steals a reference
to its second argument, but PyObject_SetItem() does not.  Read the
whole section again from its start, and this should be much clearer
the second time through.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: need helping tracking down weird bug in cPickle

2006-06-20 Thread Tim Peters
[Carl J. Van Arsdall]
> Hey everyone, cPickle is raising an ImportError that I just don't quite
> understand.

When that happens, the overwhelmingly most likely cause is that the
set of modules on your PYTHONPATH has changed since the pickle was
first created, in ways such that a module _referenced_ inside the
pickle can no longer be found.

For example, here's file resourcepmdb.py on my box:

def f():
print "hi"

Here's a bit of Python code that uses it:

import cPickle

import resourcepmdb

pf = open("junk.pik", "wb")
pickler = cPickle.Pickler(pf)
pickler.dump([resourcepmdb.f])
pf.close()

Pickle stores module globals like user-defined classes and module
functions (the case above) not by saving the code they contain, but
_just_ by saving strings recording the module's name and the global's
name.  The unpickling process creates an import out of those strings.

So long as resourcepmdb stays on my PYTHONPATH, unpickling junk.pik
works fine.  For example, this code:

import cPickle

pf = open("junk.pik", "rb")
pickler = cPickle.Unpickler(pf)
whatever = pickler.load()
pf.close()
print whatever

prints something like

[]

But if resourcepmdb gets off my path (for example, resourcepmdb.py
gets deleted, renamed, or moved -- or PYTHONPATH itself changes), that
same code dies with exactly the error you reported:

Traceback (most recent call last):
  File "pik2.py", line 5, in ?
whatever = pickler.load()
ImportError: No module named resourcepmdb

> ...
> I've done some research and it looks like cPickle tries to load some
> modules as some kind of test.

That's unlikely ;-)

> From what I can tell the module that cPickle is trying to load is a
> concatenation of the module that it exists in and part of the string
> that was sent in the orignal request.

Sorry, can't tell from here.

> It looks like something where memory is getting overwritten, but I'm not
> really doing anything too fancy, so i'm not sure how to approach this
> problem.  I'm using python2.2.  Does anyone know how I might go about
> trying to troubleshoot something like this?  Is there a good tool or a
> better method of error handling I can use to try and track this down?
> Or is it better to give up troubleshooting and just upgrade to 2.4?

You could at least _install_ a recent Python so you could use the
newer pickletools module.  pickletools.dis() displays a readable
"disassembly" of a pickle, so that there's no need to guess about what
the pickle _thinks_ it's trying to do.  pickletools won't have any
problem working with pickles created by an older Python.

> ...
> Here's the traceback:
>
> Traceback (most recent call last):
>   File "/home/build/bin/resourceManager/resourceQueue.py", line 50, in
> initQueue
> pickledList = cPickle.load(queueFPtr)
> ImportError: No module named resourcepmdb'

See above for the most obvious way to get an error like that.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Iteration over recursion?

2006-06-20 Thread Tim Peters
[MTD]
> I've been messing about for fun creating a trial division factorizing
> function and I'm naturally interested in optimising it as much as
> possible.
>
> I've been told that iteration in python is generally more
> time-efficient than recursion. Is that true?

Since you heard it from me to begin with, I suppose saying "yes" again
won't really help ;-)  You can use time.time() or time.clock(), or the
`timeit` module, to measure speed.  Try timing a dead-simple factorial
functions both ways.

BTW, I've worked on Python's implementation for close to 15 years now,
so when I say something about Python, it's _usually_ safe to just
believe it :-)  Questioning is fine, but in cases like this where it's
easy to find out for yourself, I probably won't make time to respond.

> Here is my algorithm as it stands. Any suggestions appreciated!
>
> from math import *
>
> def factorize(x):
> """ Return factors of x """
> factors = []
> lim = int(sqrt(x))

Here you're limited to integers for which a floating sqrt is accurate.
 That's 53 bits on most platforms.  For integers larger than that, the
code may produce incorrect results in mysterious ways at times
(because the square root may be inaccurate).

> y = divmod(x,2)

Most people would use "unpacking assignment" here instead, like

q, r = divmod(x, 2)

Then later code gets to use meaningful names instead of messing with
mysterious little integer subscripts.  It's also quicker to use local
names than to build subscripting expressions.

> if y[1] == 0: # x is even
> factors = factors + [2] + factorize(y[0])

Since `factors` upon entry to this block must be [], it's kinda
contorted.  More obvious (if you _have_ to use recursion) would be:

return [2] + factorize(y[0])

and then unindent the rest of the code a level.

> else:   # x is odd
> i = 3
> while i <= lim:
> y = divmod(x,i)

Again more commonly written:

q, r = divmod(x, i)

> if y[1] == 0:
> factors = factors + [i] + factorize(y[0])

It hardly matters in this context, but appending to a list is
_generally_ much faster than building a brand new list from scratch.

At a higher level, your recursions always start from 2 again; e.g., if
i happens to be 434349 here, the factorize(y[0]) call starts over from
2, despite that you already know that no divisor less than 434349 can
succeed.  To do this _efficiently_ with recursion requires passing
more knowledge across recursion levels.

> i = lim+1
> else:
> i += 2
>
> if factors == []: # necessary step for correct recursion
> factors = [x]
>
> return factors

Recursion is pretty bizarre here.  Not _wrong_, just "pretty bizarre",
and is leading you into higher-level inefficiences.  There are all
sorts of ways to repair that, but I'm not going to talk about them
because it's easy to write an iterative algorithm instead.

Here's one way to do that, avoiding float sqrt (this works correctly
for integers of any size, although trial division is hopelessly
_inefficient_ for finding "large" factors), and skipping multiples of
3 in the inner loop too:

def trial(n):
if n <= 1:
return [n]
factors = []
for tiny in 2, 3:
while n % tiny == 0:
factors.append(tiny)
n //= tiny
delta = 2
d = 5
while 1:
q, r = divmod(n, d)
if r:
if q < d:
break
d += delta
delta = 6 - delta
else:
factors.append(d)
n = q
if n > 1:
factors.append(n)
return factors

If we call the original input N, the key invariants holding at the top
of the loop are:

1. N >= 2, and N == n * the product of the integers in `factors`
2. n is not divisible by any prime < d
3. all elements in `factors` are prime

It may take a bit of thought to see that d marches over all integers
>= 5 that aren't divisible by either 2 or 3:  5, 7, 11, 13, 17, 19,
23, 25, 29, 31, ...

It's more-or-less generally true that establishing and proving loop
invariants in an iterative algorithm is "the same thing" as
establishing and proving pre- and post-conditions in a recursive
algorithm.  While it may feel strained at first, it's very much
worthwhile learning how to do that.  For example, from #2 it's easy to
prove that if q < d, n must either be 1 or a prime.  That's what
justifies the "break" statement inside the loop.  Then since that's
the only way to get out of the loop, n is either 1 or a prime when the
loop finishes, and that explains the rest of the code.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python SHA-1 as a method for unique file identification ? [help!]

2006-06-21 Thread Tim Peters
[EP <[EMAIL PROTECTED]>]
> This inquiry may either turn out to be about the suitability of the
> SHA-1 (160 bit digest) for file identification, the sha function in
> Python ... or about some error in my script

It's your script.  Always open binary files in binary mode.  It's a
disaster on Windows if you don't (if you open a file in text mode on
Windows, the OS pretends that EOF occurs at the first instance of byte
chr(26) -- this is an ancient Windows behavior that made an odd kind
of sense in the mists of history, and has persisted in worship of
Backward Compatibility despite that the original reason for it went
away _long_ ago).

...

> I am trying to reduce duplicate files in storage at home - I have a
> large number files (e.g. MP3s) which have been stored on disk multiple
> times under different names or on different paths.

...

> All seemed to be working until I examined my log files and found files
> with the same SHA digest had different sizes according to
> os.stat(fpath).st_size .  This is on Windows XP.
>
> -  Am I expecting too much of SHA-1?

No.  SHA-1 should work well for this.

> -  Is it that the os.stat data on Windows cannot be trusted?

It can be trusted to the extent that anything on Windows can be trusted ;-)

...
> def hashit(pth):
> """Takes a file path and returns a SHA hash of its string"""
> fs=open(pth,'r').read()

Make that 'rb' instead of 'r', and your problem will go away.

Do note that there are faster ways to go about this.  For example, if
a particular file has a unique size among the files you're looking at,
there's no reason to even open that file (let alone compute its hash).
 Group the files by size, and within a size group you can find
duplicates by, e.g., first comparing their initial 16 bytes, then the
next 32, then the next 64 ... etc.  If duplicates are uncommon, this
can be a huge savings.  For example, here's output from an old
dup-finding Python program of mine run over a directory tree which
happens to contain no duplicates:

Files
-
Total   10,718
Unique  10,718
Duplicate0
# w/ unique size10,053
# w/ unique prefix 665

Bytes
-
Total  1,401,668,015
Unique 1,401,668,015
Duplicate  0
Read  76,688
Excess76,688

That last two lines mean it actually read a total of only about 77,000
file bytes on its way to proving there were no duplicates among about
11,000 files spanning about 1.4 gigabytes.

No hashes are computed by that program.  I'd use a hash if instead I
were going to save a persistent (across runs) set of known hashes, so
that answering "here's a new file -- same as one I already have?"
could be done by computing its hash.  While the program above
generally reads "amazingly" few file bytes, it hammers the OS file
directory services, and it would go faster to compute a hash of a new
file than to run the whole analysis again.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Iteration over recursion?

2006-06-21 Thread Tim Peters
[Kay Schluehr]
> You might use a separate prime generator to produce prime factors. The
> factorize algorithm becomes quite simple and configurable by prime
> generators.

Alas, yours was _so_ simple that it always takes time proportional to
the largest prime factor of n (which may be n) instead of worst-case
time proportional to sqrt(n).  Repair that, and it's not quite so
simple anymore.  The speed of the sieve generator would also benefit a
lot by never looking at even integers, beyond an initial "yield 2" ...
the OP said he was looking for speed more than elegance, and they're
not good buddies here ;-)

> def eratosthenes():
> memo = {}
> q = 2
> while True:
> p = memo.pop(q, None)
> if p is None:
> yield q
> memo[q*q] = q
> else:
> x = p + q
> while x in memo:
> x += p
> memo[x] = p
> q+=1
>
> def factorize(n, sieve = eratosthenes):
> if n <= 1:
> return [n]
> factors = []
> primes = sieve()
> for q in primes:
> while n % q == 0:
> factors.append(q)
> n //= q
> if n == 1:
> return factors

At _some_ point you might think that's bound to be faster than only
skipping multiples of 2 and 3, because it does fewer divisions.  But
to get to a particular prime p, the sieve there has to go around its
outer loop about 3x as often as the trial() function goes around its
inner loop, and grows a dict with about p/ln(p) entries (while the
trial() function's memory use is constant), and those aren't free.

Applied to 991**2 (constructed so that both functions take time
proportional to sqrt(n)), on my box the factorize() function took 4x
longer than trial() (approximately 16 seconds versus 4).  Trial
division isn't practical for such large inputs, but I picked the
square of a "large" prime to give factorize() maximum advantage in
skipping division attempts (by the time we get to a prime p,
factorize() tries about p/ln(p) divisions, while trial() does about
p/3; so trial() does about ln(p)/3 times as many divisions as
factorize():  the larger the p we need, the larger that _factor_
gets).

Alas, it appears that made the dict so big that cache faults on dict
accesses more than wiped out the division advantage.  At the smaller
83**2, factorize() took only about 3.6x longer, despite losing
some of its relative division advantage.

In short, it's never what you think it is ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: random.jumpahead: How to jump ahead exactly N steps?

2006-06-21 Thread Tim Peters
[Matthew Wilson]
> The random.jumpahead documentation says this:
>
> Changed in version 2.3: Instead of jumping to a specific state, n steps
> ahead, jumpahead(n) jumps to another state likely to be separated by
> many steps..
>
> I really want a way to get to the Nth value in a random series started
> with a particular seed.  Is there any way to quickly do what jumpahead
> apparently used to do?

No known way, and it seems unlikely that any quick way will be found
in my lifetime ;-) for the Mersenne Twister.  In Pythons <= 2.2, the
underlying generator was the algebraically _very_ much simpler
original Wichman-Hill generator, and jumping ahead by exactly n steps
was just a matter of a few modular integer exponentiations.  That took
time proportional to log(n), so was extremely fast.

It was also much more _necessary_ using W-H, since W-H's period was
only around 10**13, while the Twister's period is around 10**6001:
even if you're going to use a billion random numbers per sequence, the
Twister's period has a way-way-beyond astronomical number of
non-overlapping subsequences of that length.  The chance of hitting an
overlap is correspondingly miniscule.

> I devised this function, but I suspect it runs really slowly:

Well, it takes exactly as much time as it takes to call random() n times.

> def trudgeforward(n):
> '''Advance the random generator's state by n calls.'''
> for _ in xrange(n): random.random()
>
> So any speed tips would be very appreciated.

What are you trying to achieve in the end?  Take it as a fact that
there is no known way to advance the Twister by n states faster than
what you've already found.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Iteration over recursion?

2006-06-21 Thread Tim Peters
[MTD <[EMAIL PROTECTED]>]
> I've been testing my recursive function against your iterative
> function, and yours is generally a quite steady 50% faster on
> factorizing 2**n +/- 1 for  0 < n < 60.

If you're still not skipping multiples of 3, that should account for most of it.

> I think that, for a challenge, I'll try to make a recursive function that
> matche or beats the iterative function -- it's worth the experiment!

Don't stop on that one before you succeed ;-)  Provided you make no
more than one recursive call per factor, you can't make more than
log(n, 2) recursive calls in total, and _typically_ far fewer than
that.  IOW, the number of recursive calls should generally be trivial,
and factoring 2**n will be the worst case.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Queue.Queue.queue.clear() thread-safe?

2006-06-22 Thread Tim Peters
[Russell Warren]
> I'm guessing no, since it skips down through any Lock semantics,

Good guess :-)  It's also unsafe because some internal conditions must
be notified whenever the queue becomes empty (else you risk deadlock).

> but I'm wondering what the best way to clear a Queue is then.
>
> Esentially I want to do a "get all" and ignore what pops out, but I
> don't want to loop through a .get until empty because that could
> potentially end up racing another thread that is more-or-less blindly
> filling it asynchronously.
>
> Worst case I think I can just borrow the locking logic from Queue.get
> and clear the deque inside this logic, but would prefer to not have to
> write wrapper code that uses mechanisms inside the object that might
> change in the future.

There's simply no defined way to do this now.

> Also - I can of course come up with some surrounding architecture to
> avoid this concern altogether, but a thread-safe Queue clear would do
> the trick and be a nice and short path to solution.
>
> If QueueInstance.queue.clear() isn't thread safe... what would be the
> best way to do it?  Also, if not, why is the queue deque not called
> _queue to warn us away from it?

"Consenting adults" -- if you want an operation that isn't supplied
out of the box, and are willing to take the risk of breaking into the
internals, Python isn't going to stop you from doing whatever you
like.  "mutex" isn't named "_mutex" for the same reason.  A

q.mutex.acquire()
try:
   q.queue.clear()
   q.unfinished_tasks = 0
   q.not_full.notify()
   q.all_tasks_done.notifyAll()
finally:
   q.mutex.release()

sequence probably works (caveat emptor).  BTW, it may be _possible_
you could use the newer task_done()/join() Queue gimmicks in some way
to get what you're after (I'm not really clear on that, exactly).

Anyway, yes, there is cross-release risk in breaking into the
internals like that.  That's an "adult decision" for you to make.  The
only way to get permanent relief is to re-think your approach, or
propose adding a new Queue.clear() method and Queue._clear() default
implementation.  I don't know whether that would be accepted -- it
seems simple enough, although since you're the first person to ask for
it 15 years :-), you won't get much traction arguing that's a critical
lack, and every new bit of cruft becomes an ongoing burden too.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Queue.Queue.queue.clear() thread-safe?

2006-06-22 Thread Tim Peters
[Russell Warren]
>>> I'm guessing no, since it skips down through any Lock semantics,

[Tim Peters]
>> Good guess :-)  It's also unsafe because some internal conditions must
>> be notified whenever the queue becomes empty (else you risk deadlock).

[Fredrik Lundh]
> "also" ?  if it weren't for the other things, the clear() call itself
> would have been atomic enough, right ?

Define "other things" :-)  For example, Queue.get() relies on that
when self._empty() is false, it can do self._get() successfully.
That's 100% reliable now because the self.not_empty condition is in
the acquired state across both operations.  If some yahoo ignores the
underlying mutex and clears the queue after self._empty() returns true
but before pop() executes self._get(), the call to the latter will
raise an exception.  That kind of failure is what I considered to be
an instance of a problem due to the OP's "skips down through any Lock
semantics"; the failure to notify internal conditions that the queue
is empty is a different _kind_ of problem, hence my "also" above.

If you're just asking whether deque.clear() is "atomic enough" on its
own, define "enough" :-)  It has to decref each item in the deque, and
that can end up executing arbitrary Python code (due to __del__
methods or weakref callbacks), and that in turn can allow the GIL to
be released allowing all other threads to run, and any of that
Python-level code may even mutate the deque _while_ it's being
cleared.

The C code in deque_clear() looks threadsafe in the sense that it
won't blow up regardless -- and that's about the strongest that can
ever be said for a clear() method.  dict.clear() is actually a bit
stronger, acting as if a snapshot were taken of the dict's state at
the instant  dict.clear() is called.  Any mutations made to the dict
as a result of decref side effects while the original state is getting
cleared survive.  In contrast, if a new item is added to a deque d
(via decref side effect)  while d.clear() is executing, it won't
survive.  That's "atomic enough" for me, but it is kinda fuzzy.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Questions on Threading

2006-06-26 Thread Tim Peters
[j.c.sackett]
> I'm using the threading module to accomplish some distributed processing on
> a project, and have a basic (I hope) question that I can't find an answer to
> elsewhere.
>
> I've noted that there's a lot of documentation saying that there is no
> external way to stop a thread,

True.

> and yet when creating a thread through the threading module I can see
> that it has a method called _Thread__stop, which does indeed stop it.

You're misreading the code then.  __stop() is used internally by a
thread T to record when T itself knows it's _about_ to stop.  Other
threads may be blocked in T.join() calls, and the purpose of T calling
T.__stop() is to notify such threads they can proceed.

> Is this in any way dangerous ( i.e. doesn't actually stop the thread,

That's right.

> or causes other thread timing issues down the road?)

If you call T.__stop(), T will keep running (exactly as if you hadn't
called it), but threads waiting on T.join() will think T has stopped.

> Or is there some other reason that there is officially no stop method?

Python never offered it primarily for the same reasons Java regretted
offering it (so much so that Java eventually deprecated it):

http://java.sun.com/j2se/1.3/docs/guide/misc/threadPrimitiveDeprecation.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: how to stop python...

2006-07-02 Thread Tim Peters
[bruce]
> perl has the concept of "die". does python have anything similar. how can a
> python app be stopped?
>
> the docs refer to a sys.stop.

Python docs?  Doubt it ;-)

> but i can't find anything else... am i missing something...

>>> import sys
>>> print sys.exit.__doc__
exit([status])

Exit the interpreter by raising SystemExit(status).
If the status is omitted or None, it defaults to zero (i.e., success).
If the status is numeric, it will be used as the system exit status.
If it is another kind of object, it will be printed and the system
exit status will be one (i.e., failure).



Of course there's nothing to stop you from catching SystemExit either
-- it's just another exception, but one that happens to shut down the
interpreter in the specified way if it isn't caught & handled.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Valgrind memory-checker reports memory problems in Python

2006-07-04 Thread Tim Peters
[Nathan Bates]
> Are the Python developers running Python under Valgrind?

Please read Misc/README.valgrind (in your Python distribution).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hash() yields different results for different platforms

2006-07-12 Thread Tim Peters
[Grant Edwards]
>> ...
>> The low 32 bits match, so perhaps you should just use that
>> portion of the returned hash?
>>
>> >>> hex(12416037344)
>> '0x2E40DB1E0L'
>> >>> hex(-468864544 & 0x)
>> '0xE40DB1E0L'
>>
>> >>> hex(12416037344 & 0x)
>> '0xE40DB1E0L'
>> >>> hex(-468864544 & 0x)
>> '0xE40DB1E0L'

[Qiangning Hong]
> Is this relationship (same low 32 bits) guaranteed?

No.  Nothing about hashes is guaranteed, except that when x and y are
of a hashable type, and x == y, then hash(x) == hash(y) too.

> Will it change in the future version?

That's possible, but not planned.  Note that the guts of string
hashing in CPython today is implemented via

while (--len >= 0)
x = (103*x) ^ *p++;

where x is C type "long", and the C language doesn't even define what
that does (behavior when signed multiplication overflows isn't defined
in C).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: random shuffles

2006-07-21 Thread Tim Peters
[ Boris Borcic]
> x.sort(cmp = lambda x,y : cmp(random.random(),0.5))
>
> pick a random shuffle of x with uniform distribution ?

Say len(x) == N.  With Python's current sort, the conjecture is true
if and only if N <= 2.

> Intuitively, assuming list.sort() does a minimal number of comparisons to
> achieve the sort,

No idea what that could mean in a rigorous, algorithm-independent way.

> I'd say the answer is yes. But I don't feel quite confortable
> with the intuition... can anyone think of a more solid argumentation ?

If a list is already sorted, or reverse-sorted, the minimum number of
comparisons needed to determine that is N-1, and Python's current sort
achieves that.  It first looks for the longest ascending or descending
run.  If comparison outcomes are random, it will decide "yes, already
sorted" with probability 1/2**(N-1), and likewise for reverse-sorted.
When N > 2, those are larger than the "correct" probabilities 1/(N!).
So., e.g., when N=3, the sort will leave the list alone 1/4th of the
time (and will reverse the list in-place another 1/4th).  That makes
the identity and reversal permutations much more likely than "they
should be", and the bias is worse as N increases.

Of course random.shuffle(x) is the intended way to effect a
permutation uniformly at random.
-- 
http://mail.python.org/mailman/listinfo/python-list


<    1   2   3   4