Re: [Python-ideas] Re: Magnitude and ProtoMagnitude ABCs — primarily for argument validation

2020-03-10 Thread Tim Peters via Python-list
[Marco Sulla ]
> Excuse me, Tim Peters, what do you think about my (probably heretical)
> proposal of simply raising an exception instead of return a NaN, like
> Python already do for division by zero?

Sorry, I'm missing context.  I don't see any other message(s) from you
in this thread, so don't know what you mean.  The message of mine
you're replying to here said nothing about exceptions or NaNs.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-14 Thread Tim Peters
[Tim]
>> Sorry, I'm not arguing about this any more.  Pickle doesn't work at
>> all at the level of "count of bytes followed by a string".

[Random832 ]
> The SHORT_BINBYTES opcode consists of the byte b'C', followed by *yes
> indeed* "count of bytes followed by a string".

Yes, some individual opcodes do work that way.


>> If you
>> want to make a pickle argument that makes sense, I'm afraid you'll
>> need to become familiar with how pickle works first.  This is not the
>> place for a pickle tutorial.
>>
>> Start by learning what a datetime pickle actually is.
>> pickletools.dis() will be very helpful.

> 0: \x80 PROTO  3
> 2: cGLOBAL 'datetime datetime'
>21: qBINPUT 0
>23: CSHORT_BINBYTES b'\x07\xdf\t\x0e\x15\x06*\x00\x00\x00'
>35: qBINPUT 1
>37: \x85 TUPLE1
>38: qBINPUT 2
>40: RREDUCE
>41: qBINPUT 3
>43: .STOP
>
> The payload is ten bytes, and the byte immediately before it is in fact
> 0x0a. If I pickle any byte string under 256 bytes long by itself, the
> byte immediately before the data is the length. This is how I initially
> came to the conclusion that "count of bytes followed by a string" was
> valid.

Ditto.

> I did, before writing my earlier post, look into the high-level aspects
> of how datetime pickle works - it uses __reduce__ to create up to two
> arguments, one of which is a 10-byte string, and the other is the
> tzinfo. Those arguments are passed into the date constructor and
> detected by that constructor - for example, I can call it directly with
> datetime(b'\x07\xdf\t\x0e\x15\x06*\x00\x00\x00') and get the same result
> as unpickling.

Good job!  That abuse of the constructor was supposed to remain a secret ;-)


> At the low level, the part that represents that first argument does
> indeed appear to be "count of bytes followed by a string". I can add to
> the count, add more bytes, and it will call the constructor with the
> longer string. If I use pickletools.dis on my modified value the output
> looks the same except for, as expected, the offsets and the value of the
> argument to the SHORT_BINBYTES opcode.
>
> So, it appears that, as I was saying, "wasted space" would not have been
> an obstacle to having the "payload" accepted by the constructor (and
> produced by __reduce__ ultimately _getstate) consist of "a byte string
> of >= 10 bytes, the first 10 of which are used and the rest of which are
> ignored by python <= 3.5" instead of "a byte string of exactly 10
> bytes", since it would have accepted and produced exactly the same
> pickle values, but been prepared to accept larger arguments pickled from
> future versions.

Yes, if we had done things differently from the start, things would
work differently today.  But what's the point?  We have to live now
with what _was_ done.  A datetime pickle carrying a string payload
with anything other than exactly 10 bytes will almost always blow up
under older Pythons. and would be considered "a bug" if it didn't.
Pickles are not at all intended to be forgiving (they're enough of a
potential security hole without going out of their way to ignore
random mysteries).

It may be nicer if Python had a serialization format more deliberately
designed for evolution of class structure - but it doesn't.  Classes
that need such a thing now typically store their own idea of a
"version" number as part of their pickled state  datetime never did.


> ...
> So have I shown you that I know enough about the pickle format to know
> that permitting a longer string (and ignoring the extra bytes) would
> have had zero impact on the pickle representation of values that did not
> contain a longer string?

Yes.  If we had a time machine, it might even have proved useful ;-)


> I'd already figured out half of this before
> writing my earlier post; I just assumed *you* knew enough that I
> wouldn't have to show my work.

It's always best to show your work on a public list.  Thanks for
finally ;-) doing so!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-14 Thread Tim Peters
[Random832 ]
> Would allowing a 16-byte string in the future have increased the storage
> occupied by a 10-byte string today? Would allowing a third argument in
> the future have increased the storage occupied by two arguments today?
> As far as I can tell the pickle format for non-primitive types isn't
> _that_ fixed-width.

Sorry, I'm not arguing about this any more.  Pickle doesn't work at
all at the level of "count of bytes followed by a string".  If you
want to make a pickle argument that makes sense, I'm afraid you'll
need to become familiar with how pickle works first.  This is not the
place for a pickle tutorial.

Start by learning what a datetime pickle actually is.
pickletools.dis() will be very helpful.


>>> That you didn't makes me wonder just where you're finding the space to put 
>>> the
>>> fold bit.

>> PEP 495 gives all the details.  Short course:  there are bits that are
>> _always_ 0 now within some datetime pickle bytes.  `fold` will abuse
>> one of those always-0-now pickle bits.

> And what happens to older implementations if that bit is 1?

Unpickling will raise an exception, complaining that the minute value
is out of range.


>> Aware datetimes _within_ a zone also follow the naive time model.
>> It's unfortunate that they're nevertheless called "aware" datetimes.
>>
>> So, sorry, even when sorting a list of aware datetimes, if they share
>> a common zone it is wholly intended that they all work in naive time.

> And if some of them share a common zone, then some of them will work in
> naive time, and some of them will work in aware time, and some pairs
> (well, triples) of them will cause problems for sort/bisect algorithms.

All sorts of things may happen, yes.  As I said, if you need to care,
convert to UTC first.  Most apps do nothing like this.


> Maybe it'd be best to simply ban interzone comparisons.

We cannot.  Backward compatibility.  If would have been better had
interzone comparisons and subtraction not been supported from the
start.  Too late to change that.


> Or have some sort of context manager to determine how arithmetic and 
> comparisons
> work.

Write a PEP ;-)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-14 Thread Tim Peters
[Tim]
>> Because all versions of Python expect a very specific pickle layout
>> for _every_ kind of pickled object (including datetimes)..  Make any
>> change to the pickle format of any object, and older Pythons will
>> simply blow up (raise an exception) when trying to load the new pickle
>> - or do something insane with the pickle bits.  It's impossible for
>> older Pythons to know anything about what "the new bits" are supposed
>> to mean, and there is no way to spell, in the pickle engine, "but if
>> you're an older version, skip over the next N bytes".

[Random832 ]
> Well, you could have put some reserved bits in the original pickle
> format for datetime back when it was first defined, or even just allowed
> passing in a longer string for future extension purposes.

Yes, we "could have" done that for all pickle formats for all types.
But why on Earth would we?  Pickle size is important to many apps
(e.g., Zope applications can store billions of pickles in databases.
and it may not be purely coincidence ;-) that Zope Corp paid for
datetime development), and there would have been loud screaming about
any "wasted" bytes.


> That you didn't makes me wonder just where you're finding the space to put the
> fold bit.

PEP 495 gives all the details.  Short course:  there are bits that are
_always_ 0 now within some datetime pickle bytes.  `fold` will abuse
one of those always-0-now pickle bits.


>> It's not so much a "good idea" as that it's the only idea consistent
>> with Python's "naive time" model.  Folds and gaps don't exist in naive
>> time.  Indeed, the _concept_ of "time zone" doesn't really exist in
>> naive time.  There's _inherent_ tension between the naive time model
>> and the way multi-offset time zones actually behave.  So it goes.

> But why does it need to be consistent? You can't compare naive datetimes
> with aware ones. If you want to sort/bisect a list of datetimes, they
> have to either all be naive or all be aware. So when we're talking about
> how ordering works, we're fundamentally talking about how it works for
> aware datetimes.

Aware datetimes _within_ a zone also follow the naive time model.
It's unfortunate that they're nevertheless called "aware" datetimes.

So, sorry, even when sorting a list of aware datetimes, if they share
a common zone it is wholly intended that they all work in naive time.

Apps that can't tolerate naive time should convert to UTC first.  End
of problems.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-14 Thread Tim Peters
[Random832 ]
> A) I'm still not sure why, but I was talking about adding an int, not a
> timedelta and a string.
>
> B) Older python versions can't make use of either utcoffset or fold, but
> can ignore either of them. I don't even see why they couldn't ignore a
> timedelta and a string if we felt like adding those.

Because all versions of Python expect a very specific pickle layout
for _every_ kind of pickled object (including datetimes)..  Make any
change to the pickle format of any object, and older Pythons will
simply blow up (raise an exception) when trying to load the new pickle
- or do something insane with the pickle bits.  It's impossible for
older Pythons to know anything about what "the new bits" are supposed
to mean, and there is no way to spell, in the pickle engine, "but if
you're an older version, skip over the next N bytes".


> C) What value fold "should" have can be inferred from the time, the
> utcoffset, and the tzinfo.

So you are proposing to add ... something ... _instead_ of adding
`fold`.  Already addressed that.  See above.

> I'm saying that *today*, even with no 495, it [utcoffset] does provide
> *a* value in all cases (even if that's sometimes the "wrong" value
> for an ambiguous time).

Sure.

> And that value is, for any plausible tzinfo, ordered the same for
> any given pair of datetimes with the same tzinfo as the datetimes
> considered as naive datetimes.

Yes.

> There is, or appears to be, a faction that is proposing to change that
> by sorting fold=1 2:15 before fold=0 2:45 even though the former is
> *actually* 30 minutes later than the latter, and I am *utterly baffled*
> at why they think this is a good idea.

It's not so much a "good idea" as that it's the only idea consistent
with Python's "naive time" model.  Folds and gaps don't exist in naive
time.  Indeed, the _concept_ of "time zone" doesn't really exist in
naive time.  There's _inherent_ tension between the naive time model
and the way multi-offset time zones actually behave.  So it goes.


>> It is true that the earlier and later of an ambiguous time in a fold
>> will compare equal in their own zone, but compare not equal after
>> conversion to UTC (or to any other zone in which they're not in one of
>> the latter zone's folds).  Is that what you're talking about?

> Yes. Or two different ambiguous times, where the properly earlier one
> compares greater and vice versa. I have no idea why anyone thinks this
> is reasonable or desirable behavior.

>From which I can guess, without asking, that you think "naive time"
itself is unreasonable and undesirable ;-)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-14 Thread Tim Peters
[Tim]
>> It depends on how expensive .utcoffset()
>> is, which in turn depends on how the tzinfo author implements it.

[Alex]
> No, it does not.  In most time zones, UTC offset in seconds can be computed
> by C code as a 4-byte integer

Which is a specific implementation of .utcoffset().  Which likely has
nothing to do with how most tzinfo authors will implement _their_
.utcoffset().  For example, look at any tzinfo.utcoffset()
implementation that currently exists ;-)


> faster
> than CPython can look up the .utcoffset method. (At least for times
> within a few years around now.) A programmer who makes it slower should
> be fired.

So any programmer who implements .utcoffset() in Python should be
fired?  That's the only way I can read that.


> Yet I agree,  "'premature optimization' applies at this time."

I'm more worried now about premature firing ;-)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-14 Thread Tim Peters
[Random832 ]

Whether or not datetimes stored  tm_gmtoff and tm_zone workalikes has
no effect on semantics I can see.  If, in your view, they're purely an
optimization, they're just a distraction for now.  If you're proposing
to add them _instead_ of adding `fold`, no, that can't work, for the
pickle compatibility reasons already explained.  Whether something is
in a fold needs to preserved across pickling, but "almost all" pickles
need to be readable by older Pythons too.  This is doable adding one
bit, but not doable at all if we need to add arbitrary timedelta and
string objects _instead_ of that bit.

...

>>>  (No, I don't *care* how that's not how it's defined,

>> ?  How what is defined?:

> Just trying, unsuccessfully apparently, to head off the "no, it's
> defined as working the same as a naive datetime if the tzinfo values are
> the same" argument that got brought up the *last* time I made this
> claim.

Sorry, I still don't know what this is about.


>>> it is *in fact* true for the UTC value that you will ever actually get
>>> from converting the values to UTC *today*, and it's the only total
>>> ordering that actually makes any sense)

>> Well, you lost me there.  In a post-495 world, conversion to UTC will
>> work correctly in all cases.  It cannot today.;

> It'll provide *a* value in all cases.

It will provide the correct UTC offset in all cases.


> The sort order today is equivalent to using that value in all
> cases unless you've got a pathological tzinfo
> specifically crafted to break it. I think that's an important enough
> invariant to be worth keeping, since it is the only possible way to
> provide a total order in the presence of interzone comparisons.

Show some code?  I don't know what you're talking about.

It is true that the earlier and later of an ambiguous time in a fold
will compare equal in their own zone, but compare not equal after
conversion to UTC (or to any other zone in which they're not in one of
the latter zone's folds).  Is that what you're talking about?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-14 Thread Tim Peters
[Tim]
>> It would be nice to have!  .utcoffset() is an expensive operation
>> as-is, and being able to rely on tm_gmtoff would make that dirt-cheap
>> instead.

[Alex]
> If it  is just a question of optimization,

Yes.  If it's more than just that, then 495 doesn't actually solve the
problem of getting the correct UTC offset in all cases.


> datetime objects can be extended to cache utcoffset.  Note that PyPy
> have recently added caching of the hash values in datetime objects.  I
> merged their changes in our datetime.py, but it did not look like C
> implementation would benefit from it as much as pure python did.  I
> expect something similar from caching utcoffset: a measurable
> improvement for tzinfos implemented in Python and a wash for those
> implemented in C.  (A more promising optimization approach is to define a C
> API for tzinfo interface.)

There's no answer to this.  It depends on how expensive .utcoffset()
is, which in turn depends on how the tzinfo author implements it.

I don't care now fast it is.  But, even if I did, "premature
optimization" applies at this time ;-)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-14 Thread Tim Peters
[Tim]
>> So, on your own machine, whenever daylight time starts or ends, you
>> manually change your TZ environment variable to specify the newly
>> appropriate eternally-fixed-offset zone?  Of course not.

[Random832 ]
> No, but the hybrid zone isn't what gets attached to the individual
> struct tm value when you convert a time from utc (or from a POSIX
> timestamp) to a timezone local value. A single fixed utc offset is
> (along with the name and, yes, isdst flag).

You're assuming much more than POSIX - and the ISO C standard -
requirs.  My description was quite explicitly about how POSIX has done
it all along.  tm_gmtoff and tm_zone are extensions to the standards,
introduced (IIRC) by BSD.  Portable code (including Python's
implementation) can't assume they're available.


> And pytz doesn't involve manually changing anything, it involves (as
> best it can) automatically applying the value to attach to each
> individual datetime value.

.normalize() is a manual step.  It doesn't invoke itself by magic
(although I believe Stuart would like Python to add internal hooks so
pytz _could_ get it invoked by magic).


>> A datetime object is the Python spelling of a C struct tm, but never
>> included the tm_isdst flag.

> And no-one behind this proposal seems to be contemplating adding an
> equivalent to tm_gmtoff,

It was off the table because, for backward compatibility, we need to
mess with the pickle format as little as possible.  It's vital that
datetimes obtained from old pickles continue to work fine, and that
pickles obtained from new datetime objects work fine when loaded by
older Pythons unless they actually require the new fold=1 possibility.

> despite that it would serve the same disambiguation purpose and
> make it much cheaper to maintain global invariants like a sort order
> according to the UTC value

It would be nice to have!  .utcoffset() is an expensive operation
as-is, and being able to rely on tm_gmtoff would make that dirt-cheap
instead.


> (No, I don't *care* how that's not how it's defined,

?  How what is defined?:


> it is *in fact* true for the UTC value that you will ever actually get
> from converting the values to UTC *today*, and it's the only total
> ordering that actually makes any sense)

Well, you lost me there.  In a post-495 world, conversion to UTC will
work correctly in all cases.  It cannot today.;
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-14 Thread Tim Peters
[Guido]
>> Wouldn't it be sufficient for people in Creighton to set their timezone to
>> US/Central? IIUC the Canadian DST rules are the same as the US ones. Now,
>> the question may remain how do people know what to set their timezone to.
>> But neither pytz nor datetime can help with that -- it is up to the
>> sysadmin.

[Laura Creighton]
> The Canadian DST rules are not the same as the US ones, wherein lies
> the rub.

?   All US locations observing DST switch at 2AM on the local clock on
the 2nd Sunday of March and the 1st Sunday of November (since 2007).
Best I can tell, exactly the same is true of all zones observing DST
in Canada and in Mexico.

  The province of Saskatchewan has a few areas which prefer to
> be on Mountain time, which we will ignore in this mail.  The bulk of
> them prefer to be on Central time, which is the same as Winnipeg and
> Chicago.  But what happens on DST start day (early March) in
> Saskachewan?  Provincial law mandates that you do not change your
> clock, and do not adopt daylight savings time.  The people in
> Saskachewan thus stick to CST all year round.  And this is how
> sites like http://www.timeanddate.com/worldclock/canada/saskatoon
> record things, with them at CST.

I'm sure Guido is only talking about places in Canada that _do_
observe DST.  There are also places in the US that don't observe DST.
Of course "the DST rules" only apply to locations that do observe DST,
and that's what Guido is asking about.


> However, the people of Creighton have decided to keep DST, in
> violation of the law, so they set their clocks forward.
> http://www.timeanddate.com/worldclock/canada/creighton does this
> properly as well, showing them with CDT.
>
> But this is not quite the complete story.  In many (most?) places in
> Saskatchewan, the rule is understood differently.  Instead of 'we keep
> to CST all year long' is is understood that  'we keep central time in
> the winter and mountain time in the summer'.
>
> This makes parsing log files from all over Saskachewan, where sometime
> in March things often stop saying CST and say MDT instead rather more
> interesting than the adverage person might suspect.

Read question as "in general" ;-)  There are plenty of exceptions to
the general rules all over the world.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-14 Thread Tim Peters
[Tim]
>> pytz solves it by _never_ creating a hybrid tzinfo.  It only uses
>> eternally-fixed-offset tzinfos.  For example, for a conceptual zone
>> with two possible total UTC offsets (one for "daylight", one for
>> "standard"), there two distinct eternally-fixed-offset tzinfo objects
>> in pytz.  Then an ambiguous time is resolved by _which_ specific
>> tzinfo object is attached.  Typically the "daylight" tzinfo for the
>> first time a repeated local time appears, and the "standard" tzinfo
>> for its second appearance

[Laura Creighton ]
> Yes.  I think this is a really great idea.  I have no idea why other
> people disagree.

Really?


>> In return, you have to use .localize() and .normalize() at various
>> times, because pytz's tzinfo objects themselves are completely blind
>> to the possibility of the total UTC offset changing. .localize() and
>> .normalize() are needed to possibly _replace_ the tzinfo object in
>> use, depending on the then-current date and time.

> Yes.

>>OTOH, `dateutil` does create hybrid tzinfo objects.  No dances are
>>ever needed to possibly replace them.  But it's impossible for
>>dateutil's tzinfos to disambiguate times in a fold.  Incidentally,
>>dateutil also makes no attempt to account for transitions other than
>>DST (e.g., sometimes a zone may change its _base_ ("standard") offset
>>from UTC).

> I find this totally unacceptable.

The precise referent for "this" isn't clear to me.  The lack of
knowledge of base-offset transitions is a property of dateutil's
implementation, due to inheriting the default .fromutc() instead of
implementing its own to take advantage of all the transition info a
tzfile records.  It's not _inherent_ to hybrid tzinfos.  Just an
omission in this specific zoneinfo wrapping.


> My conclusion was that hybrid tzinfo
> objects were a _really stupid idea_ proposed by somebody who misunderstood
> the problem, or rather only understood the most common case.  Smack them
> with a dead fish,  https://www.youtube.com/watch?v=i9SSOWORzw4
> and get back to work.

So, on your own machine, whenever daylight time starts or ends, you
manually change your TZ environment variable to specify the newly
appropriate eternally-fixed-offset zone?  Of course not.  Your OS
doesn't change it for you by magic either.  Your OS implements a
hybrid tzinfo:   it knows all by itself what the appropriate current
total UTC offset is, by consulting the tzfile (or hybrid POSIX TZ
rule) you attached to your account the last time you set it, however
many years ago that may have been.  Indeed, I don't know of any
software package _other_ than pytz doing it the "switch
eternally-fixed-offset zones" way.  Not that it's impossible Stuart is
the first person in history not to implement it in a "really stupid"
way ;-)  But it was more that Stuart made heroic efforts to leap the
datetime design gap explained next:

In all POSIX-ish OSes, hybrid tzinfos are just business as usual:  a
struct tm's tm_isdst flag distinguishes ambiguous times.  localtime()
(UTC->local) sets tm_isdst by magic for you, and mktime() (local->UTC)
consults tm_isdst to disambiguate local times in a fold.

A datetime object is the Python spelling of a C struct tm, but never
included the tm_isdst flag.  PEP 495 aims to supply a similar bit.
When it does, hybrid tzinfos will do cross-zone conversions correctly
in all cases.  Before then, pytz uses eternally-fixed-offset zones
primarily to overcome the current lack of that bit.

C's localtime() and mktime() are both spelled .astimezone(...) in
Python, but under the covers localtime() is essentially Python's
fromutc() (which will set the new bit appropriately after 495) and
mktime() is essentially .utcoffset() (which will consult the new bit
after 495).

That's all there is to it.  "In theory" ;-)  datetime will become
more-or-less exactly as stupid as POSIX has been for some decades.


>> So, yup, if you're thoroughly indoctrinated in pytz behavior, you will
>> be accurate but appear insane to Guido ;-)  At a semantic level, a
>> pytz tzinfo doesn't capture the notion of a zone with offset changes -
>> it doesn't even try to.  All knowledge about offset changes is inside
>> the .localize() and .normalize() dances.

> I can see why people would like to modify it to spit out this information
> when asked.  I don't understand why they would like to have a hybrid
> tzinfo.  The notion of replacing tzinfos when they become inappropriate
> fills their souls with revulsion, or something?

Because, e.g., a hybrid tzinfo (after 495) _can_ do it correctly all
the time, with no need for user interventions ever.  Same as your OS
does.  That's about usability and lack of surprise.  It didn't take
much web cruising, e.g., to find baffled new pytz users wondering why
the zone displayed "was wrong" sometimes in utterly unexceptional
cases (nowhere near a fold or a gap).  "Read the docs!  normalize()!"
Why should they _have_ to?  The only other things in their life that
demand they manual

Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-13 Thread Tim Peters
[Guido]
> Wouldn't it be sufficient for people in Creighton to set their timezone to
> US/Central? IIUC the Canadian DST rules are the same as the US ones. Now,
> the question may remain how do people know what to set their timezone to.
> But neither pytz nor datetime can help with that -- it is up to the
> sysadmin.

As Laura's use case evolved, it seems it was more that a train
traveler from Halifax to Creighton wants to tell their Halifax
relatives when they'll arrive in Creighton, but (of course) expressed
in Halifax time.  Nobody in this case knows anything about Creighton's
rules, except the traveler may be staring at a train schedule giving
arrival in Creighton time anyway.

While this may be beyond pytz's wizardy, nothing is too hard for datetime ;-)

datetime.timezone.setcontext("datetime-sig messages from mid-Sep 2015")
arrivaltime = datetime.strptime(scraped_arrival_time, "")
arrivaltime = datetime.replace(arrivaltime,
tzinfo=gettz("Context/Creighton"))
print(arrivaltime.astimezone(gettz("Context/Halifax"))
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-13 Thread Tim Peters
[Tim]
>> Whatever time zone the traveler's railroad schedule uses, so long as
>> it sticks to just one

[Laura]
> This is what does not happen.  Which is why I have written a python
> app to perform conversions for my parents, in the past.

So how did they get the right time zone rules for Creighton?


>>But there's nothing new here:  datetime has been around for a dozen
>>years already, and nobody is proposing to add any new basic
>>functionality to tzinfos.  PEP 495 is only about adding a flag to
>>allow correct conversion of ambiguous local times (typically at the
>>end of DST, when the local clock repeats a span of times) to UTC.  So
>>if this were a popular use case, I expect we would already have heard
>>of it.  Note that Python zoneinfo wrappings are already available via,
>>at least, the pytz and dateutil packages.

> I am a happy user of pytz.  On the other hand, I think this means that
> my brain has gone through some sort of non-reversible transformation
> which makes me accurate, but not exactly sane on the issue.

pytz made some strange decisions, from the POV of datetime's intended
tzinfo design.  But it also solved a problem datetime left hanging:
how to disambiguate ambiguous local times.

The _intended_ way to model zones with UTC offset transitions was via
what the docs call a "hybrid" tzinfo:  a single object smart enough on
its own to figure out, e.g., whether a datetime's date and time are in
"daylight" or "standard" time.  However, there's currently no way for
such a tzinfo to know whether an ambiguous local time is intended to
be the earlier or the later of repeated times.  PEP 495 aims to plug
that hole.

pytz solves it by _never_ creating a hybrid tzinfo.  It only uses
eternally-fixed-offset tzinfos.  For example, for a conceptual zone
with two possible total UTC offsets (one for "daylight", one for
"standard"), there two distinct eternally-fixed-offset tzinfo objects
in pytz.  Then an ambiguous time is resolved by _which_ specific
tzinfo object is attached.  Typically the "daylight" tzinfo for the
first time a repeated local time appears, and the "standard" tzinfo
for its second appearance.

In return, you have to use .localize() and .normalize() at various
times, because pytz's tzinfo objects themselves are completely blind
to the possibility of the total UTC offset changing. .localize() and
.normalize() are needed to possibly _replace_ the tzinfo object in
use, depending on the then-current date and time.

OTOH, `dateutil` does create hybrid tzinfo objects.  No dances are
ever needed to possibly replace them.  But it's impossible for
dateutil's tzinfos to disambiguate times in a fold.  Incidentally,
dateutil also makes no attempt to account for transitions other than
DST (e.g., sometimes a zone may change its _base_ ("standard") offset
from UTC).

So, yup, if you're thoroughly indoctrinated in pytz behavior, you will
be accurate but appear insane to Guido ;-)  At a semantic level, a
pytz tzinfo doesn't capture the notion of a zone with offset changes -
it doesn't even try to.  All knowledge about offset changes is inside
the .localize() and .normalize() dances.


> I think I have misunderstood Alexander Belopolsky as saying that
> datetime had functionality which I don't think it has. Thus I thought
> we must be planning to add some functionality here.  Sorry about this.

Guido told Alex to stop saying that ;-)  You can already get
eternally-fixed-offset classes, like pytz does, on (at least) Linux
systems by setting os.environ['TZ'] and then exploiting that
.astimezone() without an argument magically synthesizes an
eternally-fixed-offset tzinfo for "the system zone" (which the TZ
envar specifies) current total UTC offset.  That's not really
comparable to what pytz does, except at a level that makes a lot of
sense in theory but not much at all in practice ;-)


> However, people do need to be aware, if they are not already, that
> people with 3 times in 3 different tz will want to sort them.  Telling
> them that they must convert them to UTC before they do so is, in my
> opinion, a very fine idea. Expecting them to work this out by themselves
> via a assertion that the comparison operator is not transitive, is,
> I think, asking a lot of them.

Of course.  Note that it's _not_ a problem in pytz, though:  there are
no sorting (or transitivity) problems if the only tzinfos you ever use
have eternally fixed UTC offsets.  There are no gaps or folds then,
and everything works in an utterly obvious way - except that you have
to keep _replacing_  tzinfos when they become inappropriate for the
current dates and times in the datetimes they're attached to.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-13 Thread Tim Peters
[Laura]
>>> But I am not sure how it is that a poor soul who just wants to print a
>>> railway schedule 'in local time' is supposed to know that Creighton is
>>> using Winnipeg time.

[Tim]
>> I'm not sure how that poor soul would get a railway schedule
>> manipulable in Python to begin with ;-)

[Laura]
> Via Rail will give you a schedule when you book your tickets.  But I
> am wrong, it gives it to you in local time, which you can scrape or
> even use the via rail api.  So it is the person getting off in
> Creighton who wants to tell his relatives back in Halifax what
> time he is arriving (in their time) (so they can call him and
> avoid the hellish hotel surtax on long distance calls) who will
> have the problem.

Whatever time zone the traveler's railroad schedule uses, so long as
it sticks to just one the traveler subtracts the departure time from
the arrival time to determine how long the trip takes.  They add that
to the Halifax time at which they depart, and tell their Halifax
relatives the result.  They don't need to know anything about the
destination's time zone to do this, unless a daylight transition
occurs between departure and arrival, and the schedule itself
remembered to account for it.  In which case, pragmatically, they can
just add an hour "to be safe" ;-)


> And this is the sort of use case I think we will see a lot of.

But there's nothing new here:  datetime has been around for a dozen
years already, and nobody is proposing to add any new basic
functionality to tzinfos.  PEP 495 is only about adding a flag to
allow correct conversion of ambiguous local times (typically at the
end of DST, when the local clock repeats a span of times) to UTC.  So
if this were a popular use case, I expect we would already have heard
of it.  Note that Python zoneinfo wrappings are already available via,
at least, the pytz and dateutil packages.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-13 Thread Tim Peters
[Tim]
>> Hi, Laura!  By "zoneinfo" here, we mean the IANA (aka "Olson") time
>> zone database, which is ubiquitous on (at least) Linux:
>>
>>https://www.iana.org/time-zones
>>
>>So "will a wrapping of zoneinfo handle XYZ?" isn't so much a question
>>about the wrapping as about what's in the IANA database.

[Laura]
> Then we had better be able to override it when it is wrong.

Anyone can write their own tzinfo implementing any rules they like,
and nobody is required to use anyone else's tzinfos.

That said, zoneinfo is the most extensive collection of time zone info
there is, so most people will probably use only that.

And that said, zoneinfo is inordinately concerned with recording
highly dubious guesses about what "the rules" were even over a century
ago.  Most people would probably be happy with a tzinfo that _only_
knew what "today's rules" are.  POSIX TZ rules give a simple way to
spell exactly that.  Simple, but annoyingly cryptic.  Gustavo's
`dateutil` already supplies a way to magically build a tzinfo
implementing a zone specified by a POSIX TZ rule string.  More obvious
ways to spell that are surely possible (like, for example, the obvious
ways).

Patches welcome ;-)


>> Best guess is that Creighton's rules are covered by that database's
>> America/Winnipeg entries.
>>
>> # Saskatchewan
>> # Other western towns (e.g. Lloydminster) are like Edmonton.
>> # Matthews and Vincent (1998) write that Denare Beach and Creighton
>> # are like Winnipeg, in violation of Saskatchewan law.

> I think that this will work.
> Creighton is just across the border from Flin Flan, Manitoba.  Indeed I think
> the problem of 'drunken people from Manitoba trying to get one hours more
> drinking done and being a menace on the highway' may have fueled the
> 'we are going to have DST in violation of the law' movement in Creighton.

:-)


> But I am not sure how it is that a poor soul who just wants to print a
> railway schedule 'in local time' is supposed to know that Creighton is
> using Winnipeg time.

I'm not sure how that poor soul would get a railway schedule
manipulable in Python to begin with ;-)

If it's "a problem" for "enough" users of a computer system, a Linux
admin could simply make "America/Creighton" a link to the
"America/Winnipeg" tzfile.  But doing that for every nameable place on
Earth might be considered annoying.

To cover everyone, you may even need to specify a street address
within "a city":

http://www.quora.com/Are-there-any-major-cities-divided-by-two-time-zones

Blame politicians for this.  I can assure you Guido is not responsible
for creating this mess ;-)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-13 Thread Tim Peters
[Alex]
>>I will try to create a  zoneinfo wrapping prototype as well, but I will
>>probably "cheat" and build it on top of pytz.

[Laura Creighton]
> My question, is whether it will handle Creighton, Saskatchewan, Canada?
> Creighton is an odd little place.  Like all of Saskatchewan, it is
> in the Central time zone, even though you would expect it to be
> in the Mountain time zone based on its location on the globe.
> The people of Saskatchewan have decided not to adopt Daylight
> Savings time.  Except for the people of Creighton (and
> nearby Denare Beach) -- who _do_ observe Daylight savings time.
>
> makes for an interesting corner case, one that I remember for
> personal (and not economic, or professional) reasons.

Hi, Laura!  By "zoneinfo" here, we mean the IANA (aka "Olson") time
zone database, which is ubiquitous on (at least) Linux:

https://www.iana.org/time-zones

So "will a wrapping of zoneinfo handle XYZ?" isn't so much a question
about the wrapping as about what's in the IANA database.

Best guess is that Creighton's rules are covered by that database's
America/Winnipeg entries.  It's generally true that the database makes
no attempt to name every location on the planet.  Instead it uses
names of the form "general/specific" where "general" limits the scope
to some large area of the Earth (here "America" really means "North
America"), and "specific" names a well-known city within that area.

For example, I live in Ashland, Wisconsin (extreme far north in that
state, on Lake Superior), but so far as IANA is concerned my time zone
rules are called "America/Chicago" (some 460 air miles SSE, in a
different state).

Just for fun, I'll paste in the comments from the Saskatchewan section
of IANA's "northamerica" data file (a plain text source file from
which binary tzfiles like America/Chicago and America/Winnipeg are
generated).  You'll see Creighton mentioned if you stay alert ;-)

# Saskatchewan

# From Mark Brader (2003-07-26):
# The first actual adoption of DST in Canada was at the municipal
# level.  As the [Toronto] Star put it (1912-06-07), "While people
# elsewhere have long been talking of legislation to save daylight,
# the city of Moose Jaw [Saskatchewan] has acted on its own hook."
# DST in Moose Jaw began on Saturday, 1912-06-01 (no time mentioned:
# presumably late evening, as below), and would run until "the end of
# the summer".  The discrepancy between municipal time and railroad
# time was noted.

# From Paul Eggert (2003-07-27):
# Willett (1914-03) notes that DST "has been in operation ... in the
# City of Moose Jaw, Saskatchewan, for one year."

# From Paul Eggert (2006-03-22):
# Shanks & Pottenger say that since 1970 this region has mostly been as Regina.
# Some western towns (e.g. Swift Current) switched from MST/MDT to CST in 1972.
# Other western towns (e.g. Lloydminster) are like Edmonton.
# Matthews and Vincent (1998) write that Denare Beach and Creighton
# are like Winnipeg, in violation of Saskatchewan law.

# From W. Jones (1992-11-06):
# The. . .below is based on information I got from our law library, the
# provincial archives, and the provincial Community Services department.
# A precise history would require digging through newspaper archives, and
# since you didn't say what you wanted, I didn't bother.
#
# Saskatchewan is split by a time zone meridian (105W) and over the years
# the boundary became pretty ragged as communities near it reevaluated
# their affiliations in one direction or the other.  In 1965 a provincial
# referendum favoured legislating common time practices.
#
# On 15 April 1966 the Time Act (c. T-14, Revised Statutes of
# Saskatchewan 1978) was proclaimed, and established that the eastern
# part of Saskatchewan would use CST year round, that districts in
# northwest Saskatchewan would by default follow CST but could opt to
# follow Mountain Time rules (thus 1 hour difference in the winter and
# zero in the summer), and that districts in southwest Saskatchewan would
# by default follow MT but could opt to follow CST.
#
# It took a few years for the dust to settle (I know one story of a town
# on one time zone having its school in another, such that a mom had to
# serve her family lunch in two shifts), but presently it seems that only
# a few towns on the border with Alberta (e.g. Lloydminster) follow MT
# rules any more; all other districts appear to have used CST year round
# since sometime in the 1960s.

# From Chris Walton (2006-06-26):
# The Saskatchewan time act which was last updated in 1996 is about 30 pages
# long and rather painful to read.
# http://www.qp.gov.sk.ca/documents/English/Statutes/Statutes/T14.pdf
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-12 Thread Tim Peters
[Alex]
>>> I will try to create a  zoneinfo wrapping prototype as well, but I will
>>> probably "cheat" and build it on top of pytz.

[Tim]
>> It would be crazy not to ;-)  Note that Stuart got to punt on "the
>> hard part":  .utcoffset(), since pytz only uses fixed-offset classes.
>> For a prototype - and possibly forever after - I'd be inclined to
>> create an exhaustive list of transition times in local time, parallel
>> to the list of such times already there in UTC.

[Alex]
> Yes.  The only complication is that you need four transition points instead
> of two per year in a regular DST case: (1) start of gap; (2) end of gap; (3)
> start of fold; and (4) end of fold.  Once you know where you are with
> respect to those points, figuring out utcoffset(), dst() and tzname() for
> either value of fold is trivial.

I wouldn't call those extras transitions - they're just warts hanging
off of actual transitions.  Earlier I showed Stuart how to determine
everything about a possible fold from a UTC time using pytz's internal
info, in

PEP-431/495
Fri, 28 Aug 2015 01:01:06 -0500

He didn't reply that I saw, so it was either obvious or
incomprehensible to him ;-)  In any case, it just takes some very
simple code once the transition record the UTC time belongs in is
found.  I'd be surprised if it weren't similarly easy to determine
everything about a possible gap.

At least in a zoneinfo wrapping, a hybrid tzinfo's .utcoffset() has to
(at least internally) find "the transition record the UTC time belongs
in" regardless.


> ...
> It's a shame though to work from a transitions in UTC list

But that's what tzfiles store.  It would be insane for a zoneinfo
wrapping not to take advantage of that.  For which reason, I do
consider dateutil's zoneinfo wrapping to be insane ;-)  (It inherits
the default .fromutc())

Ah, BTW, I think dateutil's zoneinfo's wrapping also misunderstood
some of what's actually in a tzfile.  Specifically, a tzfile's "
UTC/local indicators" and " standard/wall indicators" are 100% useless
for anything we need, and shouldn't even be read from the file(*)
(seek over 'em).


> because most of DST rules are expressed in local times and then
> laboriously converted into UTC.

It's just a few lines of code in zoneinfo's zic.c.  Nobody is doing it
"by hand" there.

> I think I should also implement the POSIX TZ spec tzinfo.

For that you really should grab dateutil.  It has a full
implementation of POSIX TZ rules, as hybrid tzinfos; here from its
docs:

>>> tz1 = tzstr('EST+05EDT,M4.1.0,M10.5.0')
>>> tz2 = tzstr('AEST-10AEDT-11,M10.5.0,M3.5.0')
>>> dt = datetime(2003, 5, 8, 2, 7, 36, tzinfo=tz1)
>>> dt.strftime('%X %x %Z')
'02:07:36 05/08/03 EDT'
>>> dt.astimezone(tz2).strftime('%X %x %Z')
'16:07:36 05/08/03 AEST'

Of course this implementation is tied into dateutil's rich supply of
"calendar operations" too.


> This is where the advantage of the "as intended" approach will be obvious.

?  "As intended" is all about (to me) using hybrid tzinfos.  And those
are far richer in tzfiles than from POSIX rules.  The latter only
present us with simplest-possible DST transitions; tzfiles present us
with every absurd political manipulation yet inflicted on humankind
;-)

---
(*) Long boring story.  Short course:  those indicators are only
needed, on some systems, if a POSIZ TZ rule specifies a zone offset
but gives no rules at all for daylight transitions, _and_ the system
has a "posixrules" tzfile.  Then an insane scheme is used to make up
daylight rules "as if" the source file from which the posixrules
tzfile was created had been for a zone with the TZ-specified standard
offset instead, and these absurd indicators are used to figure out
whether the posixrules source file specified _its_ daylight rules
using UTC or local times, and if the later case then whether using
standard time or wall-clock time instead.  It's completely nuts.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-12 Thread Tim Peters
[Tim]
>> Me too - except I think acceptance of 495 should be contingent upon
>> someone first completing a fully functional (if not releasable)
>> fold-aware zoneinfo wrapping.

[Alex]
> Good idea.  How far are you from completing that?

In my head, it was done last week ;-)  In real life, I'm running out
of spare time for much of anything anymore.  I don't expect to be able
to resume zoneinfo fiddling for at least 2 weeks.


>>  Details have a way of surprising, and
>> we should learn from the last time we released a tzinfo spec in the
>> absence of any industrial-strength wrappings using it.

> I completely agree.  That's why I am adding test cases like Lord Hope Island
> and Vilnius to datetimetester.

That helps a lot, but "industrial-strength" implies "by algorithm".
There are far too many zones to deal with by crafting a hand-written
class for each.

> I will try to create a  zoneinfo wrapping prototype as well, but I will
> probably "cheat" and build it on top of pytz.

It would be crazy not to ;-)  Note that Stuart got to punt on "the
hard part":  .utcoffset(), since pytz only uses fixed-offset classes.
For a prototype - and possibly forever after - I'd be inclined to
create an exhaustive list of transition times in local time, parallel
to the list of such times already there in UTC.  An index into either
list then gives an index into the other, and into the list of
information about the transition (total offset, is_dst, etc).
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-12 Thread Tim Peters
[Guido]
>> Those pytz methods work for any (pytz) timezone -- astimezone() with a
>> default argument only works for the local time zone.

{Alex]
> That's what os.environ['TZ'] = zonename is for.  The  astimezone() method
> works for every timezone installed on your system.  Try it - you won't even
> need to call time.tzset()!

I tried it.  It makes no difference to anything for me.  I stay on
Windows to remind people that millions of Python users don't see any
of the horrid nonsense Linuxish systems force on poor users ;-)


> ...
> In any case, there are three approaches to designing a TZ database interface
> in the datetime module: the "as intended" approach, the pytz approach and
> the astimezone(zonename:str) approach.

Portability rules out #3, unless Python bundles its own zoneinfo wrapping.

pytk's approach has many attractions, like no need for `fold` and no
breakage of anything, and blazing fast .utcoffset().   Except at least
 arithmetic would have to be patched to do a
`normalize` variant by magic (to attach the now-appropriate
fixed-offset tzinfo, but without changing the clock in the process).
Alas, that would be a huge speed hit for classic arithmetic.

So, as always, the original intent is the only one that makes sense in
the end ;-)

> ...
> That's why I believe PEP 495 followed by the implementation
> of fold-aware "as intended" tzinfos (either within stdlib or by third
> parties) is the right approach.

Me too - except I think acceptance of 495 should be contingent upon
someone first completing a fully functional (if not releasable)
fold-aware zoneinfo wrapping.  Details have a way of surprising, and
we should learn from the last time we released a tzinfo spec in the
absence of any industrial-strength wrappings using it.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-12 Thread Tim Peters
>>> If there are not, maybe the intended semantics should go
>> > by the wayside and be replaced by what pytz does.

>> Changing anything about default arithmetic behavior is not a
>> possibility.  This has been beaten to death multiple times on this
>> mailing list already, and I'm not volunteering for another round of it
>> ;-)


[Alex]
> Tim and Guido only grudgingly accept it, but datetime already gives you "the
> pytz way" and PEP 495 makes a small improvement to it.

To be clear, "Tim and Guido" have nothing at all against timeline
arithmetic.  Sometimes it's exactly what you need.  But the _intended_
way to get it was always to convert to UTC first, or to just use plain
old timestamps.  Classic arithmetic was very intentionally the
default.

The only "grudgingly accepted" part is that .astimezone() grew a
special case later, to make the absence of an argument "mean
something":


> The localize/normalize functionality is provided by the .astimezone()
> method which when called without arguments will attach an appropriate
> fixed offset timezone to a datetime object.   You can then add timedeltas
> to the result and stay within a "fictitious" fixed offset timezone that 
> extends
> indefinitely in both directions.  To get back to the actual civil time - you
> call .astimezone() again.  This gives you what we call here a "timeline"
> arithmetic and occasionally it is preferable to doing arithmetic in UTC.
> (Effectively you do arithmetic in local standard time instead of UTC.)
> Using a fixed offset timezone other than UTC for timeline arithmetic is
> preferable in timezones that are far enough from UTC that business hours
> straddle UTC midnight.

The distance from UTC can't make any difference to the end result,
although if you're working in an interactive shell "it's nice" to see
intermediate results near current wall-clock time.

"A potential problem" with .astimezone()'s default is that it _does_
create a fixed-offset zone.  It's not at all obvious that it should do
so.  First time I saw it, my initial _expectation_ was that it
"obviously" created a hybrid tzinfo reflecting the system zone's
actual daylight rules, as various "tzlocal" implementations outside of
Python do.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-12 Thread Tim Peters
[]
> My context is that I am working on an idea to include utc offsets in
> datetime objects (or on a similar object in a new module), as an
> alternative to something like a "fold" attribute. and since "classic
> arithmetic" is apparently so important,

Love it or hate it, it's flatly impossible to change anything about it
now, for backward compatibility.


> I'm trying to figure out how
> "classic arithmetic" _is actually supposed to work_ when adding a
> timedelta to a time lands it on the opposite side of a transition (or in
> the middle of a "spring forward" gap).

datetime arithmetic is defined in the Python docs.


> If there is a "fall back" transition tonight, then adding a day to a
> time of 12 noon today could end up as:
>
> 12 noon tomorrow, offset still DST.
> 12 noon tomorrow, offset in standard time, 25 hours from now in real
> time.
> 11 AM tomorrow, offset in standard time, 24 hours from now in real time
>
> Which one of these is "classic arithmetic"?

12 noon tomorrow in every case, regardless of tzinfo and regardless of
whether any kind of transition may or may not have occurred.  Whether
it is or isn't in DST in this specific case isn't defined by Python -
that's entirely up to what the tzinfo implementation says.  The
_intended_ way of implementing tzinfos would say it was in standard
time.


> Pytz (if you don't
> explicitly call a "normalize" function) results in something that looks
> like the first.

Yes, because pytz always uses a fixed-offset tzinfo.  There is no
difference between timeline arithmetic and classic arithmetic in any
fixed-offset zone.


> In one of the models I've thought of, you can get the
> second by replacing the tzinfo again, or the third by doing astimezone,
> but the first preserves "exactly 24 hours in the future" in both the UTC
> moment and the naive interpretation by leaving the offset alone even if
> it is an "unnatural" offset.
>
> The second one above is what you get when you call normalize.

Yes.  .normalize() effectively converts to UTC and back again  In
fact, this is all it does:

def normalize(self, dt, is_dst=False):
if dt.tzinfo is self:
return dt
if dt.tzinfo is None:
raise ValueError('Naive time - no tzinfo set')
return dt.astimezone(self)

.fromutc() is called as the last step of .astimezone(), and .pytz
overrides the default .fromutc() to plug "the appropriate"
fixed-offset pytz tzinfo into the result.


> My question was whether there are any real implementations that work the
> intended way.

dateutil, plus all implementations anyone may have written for
themselves based on the Python doc examples.  When datetime was
originally released, there were no concrete tzinfo implementations in
the world, so lots of people wrote their own for the zones they needed
by copy/paste/edit of the doc examples.


> If there are not, maybe the intended semantics should go
> by the wayside and be replaced by what pytz does.

Changing anything about default arithmetic behavior is not a
possibility.  This has been beaten to death multiple times on this
mailing list already, and I'm not volunteering for another round of it
;-)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: [Datetime-SIG] Are there any "correct" implementations of tzinfo?

2015-09-12 Thread Tim Peters
> I was trying to find out how arithmetic on aware datetimes is "supposed
> to" work, and tested with pytz. When I posted asking why it behaves this
> way I was told that pytz doesn't behave correctly according to the way
> the API was designed.

You were told (by me) that its implementation of tzinfos was not the
_intended_ way.  Which is another way of saying it was an
unanticipated way.  "Correctly" is a whole different kind of judgment.
pytz users who faithfully follow the docs seem happy with it.


> The tzlocal module, on the other hand, appears to
> simply defer to pytz on Unix systems.
>
> My question is, _are_ there any correct reference implementations that
> demonstrate the proper behavior in the presence of a timezone that has
> daylight saving time transitions?

Which specific "proper behaviors"?  :"Hybrid" tzinfos following the
recommendations in the Python docs, including the sample
implementations in the docs, correctly mimic local clock behavior
(skipping the clock ahead when DST starts, and moving the clock back
when DST ends) when converting from UTC.  It's impossible now to do
local -> UTC conversions correctly in all cases, because it's
impossible now to know which UTC time was intended for a local time in
a fold.  For the same reason, it's impossible now to know whether a
local time in a fold is intended to be viewed as being in daylight
time or standard time.

But do note limitations of the default .fromutc() implementation:  it
only guarantees correct mimic-the-local-clock behavior when
total-offset transitions are solely due to a notion of "daylight time"
that strictly alternates between .dst() returning zero and non-zero
values.

Transitions due to any other reason may or may not be reflected in
.fromutc()'s treatment of the local clock.  Most importantly, a
transition due to a zone changing its base ("standard") UTC offset is
a possibility the default .fromutc() knows nothing about.

The wrapping of the IANA ("Olson") zoneinfo database in dateutil uses
hybrid tzinfos (the intended way of wrapping zones with multiple UTC
offsets), and inherits the default .fromutc(), so all the above
applies to it.

Including all behaviors stemming from the impossibility of
disambiguating local times in a fold.  That's not a bug in dateutil.
It's a gap in datetime's design,  It was an intentional gap at the
time, but that pytz went to such heroic lengths to fill it suggests
PEP 495 may well be overdue ;-)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: TopSort in Python?

2008-01-20 Thread Tim Peters
[EMAIL PROTECTED]
> Thanks for the topsort code.  It would be useful in a project I'm
> working on.  Can I use the code for free under public domain?  Thanks!

If I ran the world, everything I posted to a public place would be
public domain.  Alas, the last lawyer who typed at me about this
insisted that an individual in the USA cannot meaningfully disclaim
copyright, so perhaps you're required to pay me millions of dollars
instead ;-)

To keep him happy and you solvent, I hereby license the code under the
MIT license:

http://www.opensource.org/licenses/mit-license.php

Copyright (c) 1999-2008 Tim Peters

Permission is hereby granted, free of charge, to any person obtaining
a copy
of this software and associated documentation files (the "Software"),
to deal
in the Software without restriction, including without limitation the
rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be
included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN
THE SOFTWARE.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Signed zeros: is this a bug?

2007-03-11 Thread Tim Peters
[attribution lost]
...
>>> Yup: the workaround seems to be as simple as replacing all 
occurrences
>>> of -0.0 with -(0.0).  I'm embarrassed that I didn't figure this out
>>> sooner.
>>> 
>>> >>> x, y = -(0.0), 0.0
>>> >>> x, y
>>> (-0.0, 0.0)

[Alex Martelli]
>> Glad it works for you, but it's the kind of workaround that could
>> break with any minor tweak/optimization to the compiler...
>> very fragile:-(.

[also Alex]
> OTOH, Python 2.4 works just fine...:
>
> Python 2.4.3 (#1, Apr  7 2006, 10:54:33) 
> [GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> 0.0,-0.0
> (0.0, -0.0)
> >>> -0.0,0.0
> (-0.0, 0.0)
>
> so it seems to be very specifically a 2.5 problem.

It's a bug that keeps resurfacing, probably because there's no portable 
way to test that it stays fixed :-( (it's not an accident that the OP 
relied on atan2 to distinguish +0.0 from -0.0!  they act the same in 
/almost/ all contexts).

Python's grammar doesn't have negative numeric literals.  Instead

"-" CONSTANT_LITERAL

looks like unary minus applied to the non-negative CONSTANT_LITERAL.  
All early versions of Python worked that way too.

The first time the +/- 0.0 bug occurred is when the front end was 
changed to act as if

"-" CONSTANT_LITERAL

were a literal in its own right, and then that +0.0 == -0.0 caused the 
first instance of either in a compilation unit to be used as the value 
for all instances of both throughout the compilation unit.  That was 
fixed by refusing to apply the optimimization if the value of 
CONSTANT_LITERAL was a float that compared equal to 0.0.

2.5 introduced a new front end and more ambitious constant-folding, and 
I expect the bug showed up again due to one of those.

Note:  printing the value of a float 0 may not reveal its sign.  IIRC, 
glibc's float-to-string routines do display the sign of 0, but 
Microsoft's do not.  However, atan2() is sensitive to the sign under 
both glibm and Microsoft's libm equivalent.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: doctest problem with null byte

2007-01-25 Thread Tim Peters
[Stuart D. Gathman]
> I am trying to create a doctest test case for the following:
>
> def quote_value(s):
> """Quote the value for a key-value pair in Received-SPF header
> field if needed.  No quoting needed for a dot-atom value.
>
> >>> quote_value(r'abc\def')
> '"abcdef"'
> >>> quote_value('abc..def')
> '"abc..def"'
> >>> quote_value('')
> '""'
> >>> quote_value('-all\x00')
> '"-allx00"'
> ...
> 
> However, doctest does *not* like the null byte in the example (yes,
> this happens with real life input):
> **
> File "/home/stuart/pyspf/spf.py", line 1453, in spf.quote_value
> Failed example:
> quote_value('-all')
> Exception raised:
> Traceback (most recent call last):
>   File
>   "/var/tmp/python2.4-2.4.4c1-root/usr/lib/python2.4/doctest.py", 
>  line 1248, in __run
> compileflags, 1) in test.globs
> TypeError: compile() expected string without null bytes
> **
> 
> How can I construct a test cast for this?

As the docs say, doctest examples are parsed /twice/:  once when Python 
compiles the module and creates string objects, and again when doctest 
extracts and passes example substrings to the compile() function (which 
you can see in your traceback).

The first time, the \x00 in the

 quote_value('-all\x00')

portion of your docstring got changed into an honest-to-goodness NUL 
byte.  Passing that to compile() can't work.

To alleviate that, and the "leaning toothpick syndrome" in your other 
examples, a simple approach is to make your docstring a raw string 
(stick the letter 'r' in front of the opening triple-quote).

For example, this works fine:

def f():
r"""
>>> len("\\")
1
>>> ord("\x00")
0
"""

Because it's a raw string, Python does /not/ change the

\x00

portion into a NUL byte when it compile the module.  Instead the 
docstring continues to contain the 4-character substring:

\x00

and that's what compile() needs to see.

To perform the same test without using a raw string requires slamming in 
more backslashes (or other ad-hoc escaping tricks):

def g():
"""
>>> len("")
1
>>> ord("\\x00")
0
"""
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Maths error

2007-01-13 Thread Tim Peters
[Nick Maclaren]
>> ...
>> Yes, but that wasn't their point.  It was that in (say) iterative
>> algorithms, the error builds up by a factor of the base at every
>> step. If it wasn't for the fact that errors build up, almost all
>> programs could ignore numerical analysis and still get reliable
>> answers! 
>>
>> Actually, my (limited) investigations indicated that such an error
>> build-up was extremely rare - I could achieve it only in VERY
>> artificial programs.  But I did find that the errors built up faster
>> for higher bases, so that a reasonable rule of thumb is that 28
>> digits with a decimal base was comparable to (say) 80 bits with a
>> binary base. 

[Hendrik van Rooyen]
> I would have thought that this sort of thing was a natural consequence 
> of rounding errors - if I round (or worse truncate) a binary, I can be 
> off by at most one, with an expectation of a half of a least 
> significant digit, while if I use hex digits, my expectation is around
> eight, and for decimal around five... 

Which, in all cases, is a half ULP at worst (when rounding -- as 
everyone does now).

> So it would seem natural that errors would propagate 
> faster on big base systems, AOTBE, but this may be 
> a naive view.. 

I don't know of any current support for this view.  It the bad old days, 
such things were often confused by architectures that mixed non-binary 
bases with "creative" rounding rules (like truncation indeed), and it 
could be hard to know where to "pin the blame".

What you will still see stated is variations on Kahan's telegraphic 
"binary is better than any other radix for error analysis (but not very 
much)", listed as one of two techincal advantages for binary fp in:

http://www.cs.berkeley.edu/~wkahan/MktgMath.pdf

It's important to note that he says "error analysis", not "error 
propagation" -- regardless of base in use, rounding is good to <= 1/2 
ULP.  A fuller elementary explanation of this can be found in David 
Goldberg's widely available "What Every Computer Scientist Should Know 
About Floating-Point", in its "Relative Error and Ulps" section.  The 
short course is that rigorous forward error analysis of fp algorithms is 
usually framed in terms of relative error:  given a computed 
approximation x' to the mathematically exact result x, what's the 
largest possible absolute value of the mathematical

   r = (x'-x)/x

(the relative error of x')?  This framework gets used because it's more-
or-less tractable, starting by assuming inputs are exact (or not, in 
which case you start by bounding the inputs' relative errors), then 
successively computing relative errors for each step of the algorithm.  
Goldberg's paper, and Knuth volume 2, contain many introductory examples 
of rigorous analysis using this approach.

Analysis of relative error generally goes along independent of FP base.  
It's at the end, when you want to transform a statement about relative 
error into a statement about error as measured by ULPs (units in the 
last place), where the base comes in strongly.  As Goldberg explains, 
the larger the fp base the sloppier the relative-error-converted-to-ULPs 
bound is -- but this is by a constant factor independent of the 
algorithm being analyzed, hence Kahan's "... better ... but not very 
much".  In more words from Goldberg:

Since epsilon [a measure of relative error] can overestimate the
effect of rounding to the nearest floating-point number by the
wobble factor of B [the FP base, like 2 for binary or 10 for
decimal], error estimates of formulas will be tighter on machines
with a small B.

When only the order of magnitude of rounding error is of interest,
ulps and epsilon may be used interchangeably, since they differ by
at most a factor of B.

So that factor of B is irrelevant to most apps most of the time.  For a 
combination of an fp algorithm + set of inputs near the edge of giving 
gibberish results, of course it can be important.  Someone using 
Python's decimal implementation has an often very effective workaround 
then, short of writing a more robust fp algorithm:  just boost the 
precision.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Maths error

2007-01-10 Thread Tim Peters
[Tim Peters]
...
>> Huh.  I don't read it that way.  If it said "numbers can be ..." I 
>> might, but reading that way seems to requires effort to overlook the 
>> "decimal" in "decimal numbers can be ...".

[Nick Maclaren]
> I wouldn't expect YOU to read it that way,

Of course I meant "putting myself in others' shoes, I don't ...".

> but I can assure you from experience that many people do.

Sure.  Possibly even most.  Short of writing a long & gentle tutorial, 
can that be improved?  Alas, most people wouldn't read that either <0.5 
wink>.

> What it MEANS is "Numbers with a short representation in decimal

"short" is a red herring here:  Python's Decimal constructor ignores the 
precision setting, retaining all the digits you give.  For example, if 
you pass a string with a million decimal digits, you'll end up with a 
very fat Decimal instance -- no info is lost.

> can be represented exactly in decimal arithmetic", which is
> tautologous.  What they READ it to mean is "One advantage of
> representing numbers in decimal is that they can be represented
> exactly", and they then assume that also applies to pi, sqrt(2),
> 1/3 
>
> The point is that the "decimal" could apply equally well to the
> external or internal representation and, if you aren't fairly
> clued-up in this area, it is easy to choose the wrong one.

Worse, I expect most people have no real idea of that there's a possible 
difference between internal and external representations.  This is often 
given as a selling point for decimal arithmetic:  it's WYSIWYG in ways 
binary fp can't be (short of inventing power-of-2 fp representations for 
I/O, which few people would use).

[attribution lost]
>>>>> and how is decimal no better than binary?

[Tim]
>>>> Basically, they both lose info when rounding does occur.  For
>>>> example, 

[Nick]
>>> Yes, but there are two ways in which binary is superior.  Let's
>>> skip the superior 'smoothness', as being too arcane an issue for
>>> this group,

>> With 28 decimal digits used by default, few apps would care about
>> this anyway.

> Were you in the computer arithmetic area during the "base wars" of the
> 1960s and 1970s that culminated with binary winning out?

Yes, although I came in on the tail end of that and never actually used 
a non-binary machine.

> A lot of very well-respected numerical analysts said that larger bases
> led to a faster build-up of error (independent of the precision).  My
> limited investigations indicated that there was SOME truth in that,
> but it wasn't a major matter; I never say the matter settled
> definitively.

My point was that 28 decimal digits of precision is far greater than 
supplied even by 64-bit binary floats today (let alone the smaller sizes 
in most-common use back in the 60s and 70s).  "Pollution" of low-order 
bits is far less of a real concern when there are some number of low-
order bits you don't care about at all.

>>> and deal with the other.  In binary, calculating the mid-point
>>> of two numbers (a very common operation) is guaranteed to be within
>>> the range defined by those numbers, or to over/under-flow.
>>>
>>> Neither (x+y)/2.0 nor (x/2.0+y/2.0) are necessarily within the range
>>> (x,y) in decimal, even for the most respectable values of x and y.
>>> This was a MAJOR "gotcha" in the days before binary became standard,
>>> and will clearly return with decimal.

>> I view this as being an instance of "lose info when rounding does 
>> occur".  For example,
 
> No, absolutely NOT!

Of course it is.  If there were no rounding errors, the computed result 
would be exactly right -- that's darned near tautological too.  You 
snipped the examples I gave showing exactly where and how rounding error 
created the problems in (x+y)/2 and x/2+y/2 for some specific values of 
x and y using decimal arithmetic.  If you don't like those examples, 
supply your own, and if you get a similarly surprising result you'll 
find rounding error(s) occur(s) in yours too.

It so happens that rounding errors in binary fp can't lead to the same 
counterintuitive /outcome/, essentially because x+x == y+y implies x == 
y in base 2 fp, which is indeed a bit of magic specific to base 2.  The 
fact that there /do/ exist fp x and y such that x != y yet x+x == y+y in 
bases > 2 is entirely due to fp rounding error losing info.

> This is an orthogonal matter,

Disagree.

> and is about the loss of an important invariant when using any base
> above 2.

It is that.

> Back in the days when the

Re: Maths error

2007-01-09 Thread Tim Peters
[Tim Peters]
...
>|> Well, just about any technical statement can be misleading if not
>|> qualified to such an extent that the only people who can still
>|> understand it knew it to begin with <0.8 wink>.  The most dubious
>|> statement here to my eyes is the intro's "exactness carries over
>|> into arithmetic".  It takes a world of additional words to explain
>|> exactly what it is about the example given (0.1 + 0.1 + 0.1 - 0.3 =
>|> 0 exactly in decimal fp, but not in binary fp) that does, and does
>|> not, generalize.  Roughly, it does generalize to one important
>|> real-life use-case:  adding and subtracting any number of decimal 
>|> quantities delivers the exact decimal result, /provided/ that
>|> precision is set high enough that no rounding occurs.

[Nick Maclaren]
> Precisely.  There is one other such statement, too: "Decimal numbers
> can be represented exactly."  What it MEANS is that numbers with a
> short representation in decimal can be represented exactly in decimal,
> which is tautologous, but many people READ it to say that numbers that
> they are interested in can be represented exactly in decimal.  Such as
> pi, sqrt(2), 1/3 and so on 

Huh.  I don't read it that way.  If it said "numbers can be ..." I 
might, but reading that way seems to requires effort to overlook the 
"decimal" in "decimal numbers can be ...".

[attribution lost]
>|>> and how is decimal no better than binary?
 
>|> Basically, they both lose info when rounding does occur.  For
>|> example, 

> Yes, but there are two ways in which binary is superior.  Let's skip
> the superior 'smoothness', as being too arcane an issue for this
> group,

With 28 decimal digits used by default, few apps would care about this 
anyway.

> and deal with the other.  In binary, calculating the mid-point
> of two numbers (a very common operation) is guaranteed to be within
> the range defined by those numbers, or to over/under-flow.
>
> Neither (x+y)/2.0 nor (x/2.0+y/2.0) are necessarily within the range
> (x,y) in decimal, even for the most respectable values of x and y.
> This was a MAJOR "gotcha" in the days before binary became standard,
> and will clearly return with decimal.

I view this as being an instance of "lose info when rounding does 
occur".  For example,

>>> import decimal as d
>>> s = d.Decimal("." + "9" * d.getcontext().prec)
>>> s
Decimal("0.")
>>> (s+s)/2
Decimal("1.000")
>>> s/2 + s/2
Decimal("1.000")

"The problems" there are due to rounding error:

>>> s/2  # "the problem" in s/2+s/2 is that s/2 rounds up to exactly 1/2
Decimal("0.5000")

>>> s+s # "the problem" in (s+s)/2 is that s+s rounds up to exactly 2
Decimal("2.000")

It's always something ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Maths error

2007-01-09 Thread Tim Peters
[Rory Campbell-Lange]
>>> Is using the decimal module the best way around this? (I'm
>>> expecting the first sum to match the second). It seem
>>> anachronistic that decimal takes strings as input, though.

[Nick Maclaren]
>> As Dan Bishop says, probably not.  The introduction to the decimal
>> module makes exaggerated claims of accuracy, amounting to propaganda.
>> It is numerically no better than binary, and has some advantages
>> and some disadvantages.

[Carsten Haese]
> Please elaborate. Which exaggerated claims are made,

Well, just about any technical statement can be misleading if not qualified 
to such an extent that the only people who can still understand it knew it 
to begin with <0.8 wink>.  The most dubious statement here to my eyes is 
the intro's "exactness carries over into arithmetic".  It takes a world of 
additional words to explain exactly what it is about the example given (0.1 
+ 0.1 + 0.1 - 0.3 = 0 exactly in decimal fp, but not in binary fp) that 
does, and does not, generalize.  Roughly, it does generalize to one 
important real-life use-case:  adding and subtracting any number of decimal 
quantities delivers the exact decimal result, /provided/ that precision is 
set high enough that no rounding occurs.

> and how is decimal no better than binary?

Basically, they both lose info when rounding does occur.  For example,

>>> import decimal
>>> 1 / decimal.Decimal(3)
Decimal("0.")
>>> _ * 3
Decimal("0.")

That is, (1/3)*3 != 1 in decimal.  The reason why is obvious "by eyeball", 
but only because you have a lifetime of experience working in base 10.  A 
bit ironically, the rounding in binary just happens to be such that (1/3)/3 
does equal 1:

>>> 1./3
0.1
>>> _ * 3
1.0

It's not just * and /.  The real thing at work in the 0.1 + 0.1 + 0.1 - 0.3 
example is representation error, not sloppy +/-:  0.1 and 0.3 can't be 
/represented/ exactly as binary floats to begin with.  Much the same can 
happen if you instead you use inputs exactly representable in base 2 but 
not in base 10 (and while there are none such if precision is infinite, 
precision isn't infinite):

>>> x = decimal.Decimal(1) / 2**90
>>> print x
8.077935669463160887416100508E-28
>>> print x + x + x - 3*x   # not exactly 0
1E-54

The same in binary f.p. is exact, because 1./2**90 is exactly representable 
in binary fp:

>>> x = 1. / 2**90
>>> print x  # this displays an inexact decimal approx. to 1./2**90
8.07793566946e-028
>>> print x + x + x - 3*x  # but the binary arithmetic is exact
0.0

If you boost decimal's precision high enough, then this specific example is 
also exact using decimal; but with the default precision of 28, 1./2**90 
can't be represented exactly in decimal to begin with; e.g.,

>>> decimal.Decimal(1) / 2**90 * 2**90
Decimal("0.")

All forms of fp are subject to representation and rounding errors.  The 
biggest practical difference here is that the `decimal` module is not 
subject to representation error for "natural" decimal quantities, provided 
precision is set high enough to retain all the input digits.  That's worth 
something to many apps, and is the whole ball of wax for some apps -- but 
leaves a world of possible "surprises" nevertheless.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Q: About a programming problem with Python!

2007-01-08 Thread Tim Peters
[followups set to comp.lang.python]

[Danny]
> I am just getting into OOP using Python and
> because of the floating point of large integer (L)
> square roots all the decimal expansions are truncated.
> This has created a problem because I need these
> decimal expansions in my algorithm other wise
> it adds more sub iterations slowing down
> greatly to the programs performance.
>
> Is there any way of retrieving these truncated decimal
> expansions up to a certain (n) with Python?

You want comp.lang.python for this, not sci.math.  Be sure to give a 
concrete example, giving a specific input and showing exactly as 
possible what you /want/ to achieve.  There are so many possible 
ambiguities in what you wrote here that few people will take time to 
guess at what you might have intended to ask.

> Java and I believe c++ do not have this problem
> but I am not about to learn any of these because 
> I like Python. 

Programming language has little to nothing to do with this, it's the 
data type you're using.  In Java, C++, and Python, if you're using the 
native double-precision floating-point type, you're limited to (on most 
machines) at most 53 bits of precision.

If you need more than that, you need to use a different data type.  For 
example, in recent versions of Python you could use the `decimal` 
module, and set its precision to virtually anything you like; e.g., 
under Python 2.5,

>>> import decimal
>>> t = decimal.Decimal(2)
>>> decimal.getcontext().prec  # show the default precision
28
>>> print t.sqrt()   # sqrt(2) rounded to 28 significant digits
1.414213562373095048801688724
>>> decimal.getcontext().prec = 56  # double the precision
>>> print t.sqrt()   # sqrt(2) to rounded to 56 significant digits
1.4142135623730950488016887242096980785696718753769480732

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a question on python dict

2007-01-03 Thread Tim Peters
[Tim Peters]
>> ...
>> Taking my response out of context to begin with doesn't really change
>> that I answered the question he asked ;-)

[Fredrik Lundh]
> welcome to comp.lang.python.
> 
>  

Thanks for the welcome!  It's tough to be a newbie here ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a question on python dict

2006-12-31 Thread Tim Peters
[Tim Peters]
>>>> You should also note that copying a dict key or value (no matter of
>>>> what type) consists in its entirety of copying one machine address (a
>>>> 4- or 8-byte pointer, depending on platform).

[Lawrence D'Oliveiro]
>>> Actually, no. It also consists of updating reference counts as well.

[Tim Peters]
>> Not in context:  dict resizing is refcount-neutral...

[Lawrence D'Oliveiro]
> The question wasn't about resizing, it was about "copying a dict key or
> value".

Then you misunderstood the OP.  He was asking about dict resizing:

...

When there are more and more items being added to the
hashtable, it increase its buckets and copy the old items into the
new one and re-calculate the hash value of each item.

I think this will waste some time doing the increasing-copying
thing.

Taking my response out of context to begin with doesn't really change that 
I answered the question he asked ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a question on python dict

2006-12-31 Thread Tim Peters
[Tim Peters]
>> You should also note that copying a dict key or value (no matter of
>> what type) consists in its entirety of copying one machine address (a
>> 4- or 8-byte pointer, depending on platform).

[Lawrence D'Oliveiro]
> Actually, no. It also consists of updating reference counts as well.

Not in context:  dict resizing is refcount-neutral, and the CPython 
implementation of dict resizing exploits that (quite directly in 2.5; 
indirectly before 2.5).  It's not just that refcounts end up the same 
/across/ dict resizing, it's that they're not changed internally either for 
the duration of the resizing operation.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a question on python dict

2006-12-20 Thread Tim Peters
[EMAIL PROTECTED]
> Python dict is a hash table, isn't it?

Yup.

> I know that hashtable has the concept of "bucket size" and "min bucket
> count" stuff,

Some implementations of hash tables do.  Python's does not.  Python's
uses what's called "open addressing" instead.

> and they should be configurable so I can set them to the proper value
> when I know how many items I'm going to handle.
>
> If these two values can't be set, the hashtable will give them default
> values. When there are more and more items being added to the
> hashtable, it increase its buckets and copy the old items into the new
> one and re-calculate the hash value of each item.

That's several distinct claims, each of which is true of some hash
table implementations but not of others.  Python's implementation has
no "buckets", so all that's simply irrelevant.  Python's
implementation (not all do) saves the hash value of each item, so when
it does need to grow (or shrink) the space allocated to the table, it
does not need to recalculate the hash value of any item.

You should also note that copying a dict key or value (no matter of
what type) consists in its entirety of copying one machine address (a
4- or 8-byte pointer, depending on platform).

> I think this will waste some time doing the increasing-copying thing.

It does take time to resize.  "Waste" is a value judgment ;-)

> If I know I'm going to handle about 2 items, I can set the size of
> hashtable to 3.
>
> So, can I do this in python?

You can't.  Unless you build up to 2 items over and over (many
thousands of times), it's unlikely you'd be able to measure the time
difference even if you could.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: merits of Lisp vs Python

2006-12-13 Thread tim . peters
[Bill Atkins]
>> (Why are people from c.l.p calling parentheses "brackets"?)

[Kaz Kylheku]
> Because that's what they are often called outside of the various
> literate fields.

For example, the English are "outside of the various literate fields"?

FWIW, Python documentation consistently uses the jargon:

() parentheses
{} braces
[] brackets

That matches North American conventions, but occasionally confuses an
international audience (for example, the English call parentheses
"brackets" or "round brackets").

There's also a long tradition in both mathematics and computer science
of using "bracket" as a generic term for any syntactic device used in
pairs.  For example, the "Revised Report on the Algorithmic Language
Algol 60" way back in 1963 even called "begin" and "end" brackets.  If
it's tempting to call the authors of that illiterate too, keep in mind
that John McCarthy was one of them -- although I'm sure Peter Naur
would be willing to take the blame for dumbing it down for Europeans ;-)

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: not a big deal or anything, but, curiously:

2006-12-12 Thread Tim Peters
[Simon Schuster]
> following this tutorial,

Which tutorial?

> I copied and pasted:
>
> from string import *
>
> cds = """atgagtgaacgtctgagcattagctccgtatatcggcgcacaaa
> tttcgggtgccgacctgacgcgcccgttaagcgataatcagtttgaacagctttaccatgcggtg
> ctgcgccatcaggtggtgtttctacgcgatcaagctattacgccgcagcagcaacgcgcgctggc
> ccagcgggcgaattgcatattcaccctgtttacccgcatgccgaattgacgagatca
> tcgtgctggatacccataacgataatccgccagataacgacaactggcataccgatgtgacattt
> attgaaacgccacccgcacgattctggcagctaaagagttaccttcgaccggcggtgatac
> gctctggaccagcggtattgcggcctatgaggcgctctctgttcccttccgccagctgctgagtg
> ggctgcgtgcggagcatgatttccgtaaatcgttcccggaatacaaataccgcccgaggag
> gaacatcaacgctggcgcgaggcggtcgcgacccgccgttgctacatccggtggtgcgaac
> gcatccggtgagcggtaaacaggcgctgtttgtgaatgaaggctttactacgcgaattgttgatg
> tgagcgagaaagagagcgaagccttgttaagttgtttgcccatatcaccaaaccggagttt
> caggtgcgctggcgctggcaaccaaatgatattgcgatttgggataaccgcgtgacccagcacta
> tgccaatgccgattacctgccacagcgacggataatgcatcgggcgacgatccttataaac
> cgatcgggctaa""".replace("\n","")
>
> gc = float(count(cds, 'g') + count(cds, 'c'))/ len(cds)
>
> print gc
>
> -fin-
>
> which should yield: 0.54460093896713613..
>
> but when I ran it I got: 0.544600938967
>
> looking now I see it's truncating after a certain number of decimal
> places. any ideas why?

Read the Python Tutorial appendix on floating-point issues:

http://docs.python.org/tut/node16.html

As it says, str(a_float) rounds to 12 significant digits, and
repr(a_float) to 17.  The `print` statement implicitly applies str()
to each item it prints.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: merits of Lisp vs Python

2006-12-11 Thread Tim Peters
[Paddy]
>>   http://en.wikipedia.org/wiki/Doctest

[Kaz Kylheku]
> I pity the hoplelessly anti-intellectual douche-bag who inflicted this
> undergraduate misfeature upon the programming language.

As a blind misshapen dwarf, I get far too much pity as it is, but I
appreciate your willingness to share yours.  If you like, feel free to
give this precious gift to another.  I get the impression that pity is
something you don't feel often, and it would be a shame to waste it.

> This must be some unofficial patch that still has a hope of being shot
> down in flames, right?

Alas, no.  It was secretly snuck into Python's standard library in the
2.1 release (2001), and although we've done our best to hide it, it's
widely used now.  Strangely enough, it's especially popular among
intellectuals ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python spam?

2006-11-30 Thread Tim Peters
[Aahz]
>>> Anyone else getting "Python-related" spam?  So far, I've seen messages
>>> "from" Barry Warsaw and Skip Montanaro (although of course header
>>> analysis proves they didn't send it).

[Thomas Heller]
>> I'm getting spam not only from Barry, but also from myself ;-) with
>> forged headers.  But I'm not sure what you mean with Python-related...
>> Not the contents, IIRC.

[Aahz[
> Thing is, I don't usually get spam "from" people I know (or at least it
> gets filtered before I see it), so someone is clearly using some resource
> of Python-related email addresses.  Just seems rather odd, and I wonder
> whether it's some kind of DoS attack or what.

It's been going on for years.  The most frequent forged "Python
related" sender address I've seen is actually [EMAIL PROTECTED],
followed (but not closely) by /F's [EMAIL PROTECTED]  Spammers
harvest legit addresses to forge via scraping web pages and via
Trojans scouring newbies' address books.  The latter probably accounts
for the very high rate of [EMAIL PROTECTED] forgeries.  Via the former,
anyone with a "public" email address can expect to see it get forged
sooner or later.

About two years ago I sent a polite email to a porn vendor asking them
to please stop forging one of my email addresses as the sender of
their spam.  They didn't reply, but within a few days I stopped
receiving porn spam claiming to come from me.  Frankly, I miss it :-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: syntax error in sum(). Please explicate.

2006-11-18 Thread Tim Peters
[Matt Moriarity]
>> try surrounding your sum argument in brackets:
>>
>> sum([phi(x // ps[i+1], i) for i in range(a)])
>>
>> instead of:
>>
>> sum(phi(x // ps[i+1], i) for i in range(a))

[Michael Press]
> Thank you. That makes it work.

But is a wrong solution ;-)  As others have suggested, it's almost
certainly the case that you were using a too-old version of Python.
2.5 is current, and 2.4.4 (the current bugfix release in the 2.4 line)
would also work.  You must have been using a version before 2.4, and
if you're just starting with Python, it's a Bad Idea to artificially
burden yourself with an old version where current examples can't work
;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Decimal() instead of float?

2006-11-17 Thread Tim Peters
[Michael B. Trausch]
>> Let's say that I want to work with the latitude 33.6907570.  In Python,
>> that number > can not be stored exactly without the aid of
>> decimal.Decimal().
>>
>> >>> 33.6907570
>> 33.6907568
>> >>>
>>
>> As you can see, it loses accuracy after the 6th decimal place.

[Terry Reedy]
> No, it loses accuracy in the 15th decimal place, which is the 17th digit.
> This is far more accurate than any measured latitude could be.
> If the above is a measurement, it represents anything from 33.69075695
> to 33.69075705.  The conversion error would have to be multiplied by some
> number of millions before it would be visible in the 7th decimal place
> (your last).
>
> Any trigonometric calculations will use an *approximate* value of pi and
> polynomial *approximations* of the theoretical functions.

The trend in "modern" math libraries is actually that sin(), cos() and
tan() return a result strictly less than 1 ULP off from the true
mathematical result (including "acting as if" they used an infinitely
precise value for pi).  It remains unclear what actual benefit there
is in, e.g., treating a sloppy argument like 2.31e273 "as if" it were
infinitely precise and reducing it by an infinitely precise value of
pi, /except/ that it allows making precise statements about the
accuracy of functions doing so.  In effect, this way all the blame
gets pinned on the user if they don't like the results ;-)

> Perhaps you have been given impossible conditions to fulfill.

I'm sure it's inconceivable to any programmer that management would
have an unreasonable expectation.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there a commas-in-between idiom?

2006-11-05 Thread Tim Peters
]Ernesto García García]
> it's very common that I have a list and I want to print it with commas
> in between. How do I do this in an easy manner, whithout having the
> annoying comma in the end?
>
> 
>
> list = [1,2,3,4,5,6]
>
> # the easy way
> for element in list:
>print element, ',',
>
> print
>
>
> # this is what I really want. is there some way better?
> if (len(list) > 0):

More idiomatic as

if len(list) > 0:

and even more so as plain

if list:

>print list[0],
>for element in list[1:]:
>  print ',', element,

Do you really want a space before and after each inter-element comma?

> 

An often-overlooked alternative to playing with ",".join() is:

print str(list)[1:-1]

That is, ask Python to change the list into a big string, and just
strip the brackets off each end:

>>> alist = ['a, bc', 4, True]
>>> print str(alist)[1:-1]
'a, bc', 4, True

Note the quotes around the string element!  This differs from what
your code snippet above would produce (in addition to differing wrt
spaces around inter-element commas):

a, bc , 4 , True
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ZODB: single database, multiple connections

2006-10-30 Thread Tim Peters
[Petra Chong]
> I am using Python 2.3 and ZODB (without the rest of Zope) with the
> following pattern:
>
> * One process which writes stuff to a ZODB instance (call it test.db)
> * Another process which reads stuff from the above ZODB instance
> test.db
>
> What I find is that when the first process writes, the second doesn't
> pick up the changes. I am sure this must be because I am using ZODB
> wrongly, but I can't find any examples that show how to use ZODB in
> this way, and I can't find any API docs for FileStorage, Connection,
> etc. Reading the source code (from C:\python23\lib\site-packages) has
> not thrown up anything useful.
>
> Here's my test code:
>
> A simple database class:
>
> ...
>
> Write:
>
> ...
>
> Read:
>
> db = Database("test.db", read_only = True)
>
> data = db.get_dictionary('data')
>
> If I have a Python shell open and run the above two lines, if I run the
> write process repeatedly, the above "data" object never contains any of
> the newly added items. To pick them up I have to totally recreate the
> "db" object.

You say that like it's hard to do ;-)

It's a decent way to proceed.  ZODB is a database, and has
transactional semantics:  you never see new object state on the read
side because you're /in/ a transaction, and a transaction guarantees
to give you a consistent view of the data.  The view would be
inconsistent if it showed you state committed by different
transactions on the write side while you're still in the same
transaction on the read side.

> I must be doing something wrongly, but I can't figure out what.

Seems to be a conceptual problem more than anything else.

> Any suggestions?

You already know that tossing your connection and opening a new
connection will give you a newer view of the database, and it's
unclear why you don't think that's good enough.  Other approaches
amount to telling ZODB (on the read side) that you're done with the
current transaction.  For example, try doing

transaction.abort()

on the read side when you're ready to see newer object state.

BTW, a better place to ask about ZODB is the zodb-dev list:

http://mail.zope.org/mailman/listinfo/zodb-dev

It's not just for developers /of/ ZODB.  Note that you need to
subscribe to it in order to post to it (that's a heavyweight anti-spam
gimmick common to all Zope lists).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Dealing with multiple sets

2006-10-25 Thread Tim Peters
[John Henry]
> If I have a bunch of sets:
>
> a = set((1, 2, 3))
> b = set((2, 3))
> c = set((1, 3))
> 
>
> What's the cleanest way to say:
>
> 1) Give me a list of the items that are in all of the sets? (3 in the
> above example)

list(a & b & c)

> 2) Give me a list of the items that are not in all of the sets? (1,2 in
> the above example)

list((a | b | c) - (a & b & c))
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: File read-write mode: problem appending after reading

2006-10-14 Thread Tim Peters
[Frederic Rentsch]
>   Thanks a lot for your input. I seemed to notice that  everything
> works fine without setting the cursor as long as it stops before the end
> of the file. Is that also a coincidence that may not work?

"if you want to read following a write, or write following a read, on
the same stream, you must perform a file-positioning operation
(typically a seek) between them"
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: File read-write mode: problem appending after reading

2006-10-13 Thread Tim Peters
[Frederic Rentsch]
>Working with read and write operations on a file I stumbled on a
> complication when writes fail following a read to the end.
>
>  >>> f = file ('T:/z', 'r+b')
>  >>> f.write ('abcdefg')
>  >>> f.tell ()
> 30L
>  >>> f.seek (0)
>  >>> f.read ()
> 'abcdefg'
>  >>> f.flush ()  # Calling or not makes no difference
>  >>> f.write ('abcdefg')

Nothing is defined about what happens at this point, and this is
inherited from C.  In standard C, if you want to read following a
write, or write following a read, on the same stream, you must perform
a file-positioning operation (typically a seek) between them.

> Traceback (most recent call last):
>   File "", line 1, in -toplevel-
> f.write ('abcdefg')
> IOError: (0, 'Error')

That's one possible result.  Since nothing is defined, any other
outcome is also a possible result ;-)

> Flushing doesn't help.

Right, and because flush() is not a file-positioning operation.

> I found two work arounds:
>
>  >>> f.read ()
> 'abcdefg'
>  >>> f.read ()   # Workaround 1: A second read (returning an empty string)
> ''
>  >>> f.write ('abcdefg')
> (No error)

Purely an accident; may or may not "work" the next time you try it, or
on another platform; etc.

>  >>> f.read ()
> 'abcdefg'
>  >>> f.seek (f.tell ())   # Workaround 2: Setting the cursor (to where it is!)

That's a correct approach.  f.seek(0, 1) (seek 0 bytes from the
current position) is a little easier to spell.

>  >>> f.write ('abcdefg')
> (No error)
>
> I found no problem with writing into the file. So it looks like it has
> to do with the cursor which a read puts past the end, unless it is past
> the end, in which case it goes back to the end. Is there a less kludgy
> alternative to "fseek (ftell ())"?

As above, you need to seek when switching from reading to writing, or
from writing to reading.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hundreds of seconds?

2006-10-11 Thread Tim Peters
[EMAIL PROTECTED]
> ...
> G5-fiwihex:~ eur$ python
> Python 2.3.5 (#1, Mar 20 2005, 20:38:20)
> [GCC 3.3 20030304 (Apple Computer, Inc. build 1809)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import time
> >>> time.time()
> 1160580871.258379
> >>>
>
> My G5 has lots of digits behind the decimal point, my measuring PC runs
> W98. We'll see how it does there. But I trust it to be enough digits.

On Windows 98, time.time() typically updates only once per 0.055
seconds (18.2 Hz), but time.clock() typically updates more than a
million times per second.  You do /not/ want to use time.time() for
sub-second time measurement on Windows.  Use time.clock() for this
purpose on Windows.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python license question

2006-10-08 Thread Tim Peters
[Martitza]
> Mr. Peters:

Na, my father's dead -- you can call me Uncle Timmy ;-)

> Thank you for so kindly taking the time to resolve my misunderstandings
> and to elaborate on the intent of the PSF.
>
> In particular, thank you for explaining in plain language how the
> licenses stack.  I'm sure our counsel will figure out what a license
> from a defunct BeOpen means and anything we do will be in compliance
> with all of the license stack.

I don't know BeOpen.com's legal status (for example, I don't know
whether a bankruptcy judgement was issued).  CWI is a Dutch national
research institute, and CNRI and the PSF are both US 501(c)(3) public
charities -- BeOpen.com was the only for-profit entity in Python's
licensing history.  It's quite intentional that the top three licenses
on the stack "look pretty much alike" -- if I had my way, there would
be only one license, but the parties involved couldn't agree to that
at the time.

While at least the PSF will pursue licence violations, the license is
so permissive that there hasn't yet been any need for that.  To the
best of my knowledge, BeOpen.com, CNRI, and CWI have never had license
complaints against anyone's use of Python either.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python license question

2006-10-08 Thread Tim Peters
[Martitza]
|> Hi.  I work for a small company (actually in process of forming)
> interested in embedding or extending python as part of our commercial
> non-open-source product.  We have legal counsel, but are interested in
> the spirit as well as the letter of the law.  Not much seems to have
> been written about the python license since version 2, so pointers to
> more recent discussions or contacts are appreciated.  If this is not
> the right place to ask these questions, I would welcome better ideas.
>
> We've read the license and "layman's language" at
> http://www.python.org/psf/license/ and are need help reconciling the
> two.

There's also the informal license FAQ:

http://wiki.python.org/moin/PythonSoftwareFoundationLicenseFaq

I'm not a lawyer, so can't give legal advice.  I can explain intent,
speaking as a Director of the Python Software Foundation.

> First Observation
> The entire license -- such as produced by license() -- includes a lot
> of information which appears to be historical or anecdotal, not the
> kind of language our lawyers usually see in licensing agreements.
> Lawyers claim they don't like "extra" language.  You be the judge.  :)

Section A (HISTORY OF THE SOFTWARE) isn't really part of the license.
But because the PSF doesn't have the right to /re/license the portions
of Python copyright by CWI, CNRI, or BeOpen.com, the history given
there can be helpful if anyone is inclined to spend (IMO) too much
time worrying about the licenses.

> First Question
> Is all of that information really a required part of the license we
> pass on to our customers or is the "PYTHON SOFTWARE FOUNDATION
> LICENSE VERSION 2" portion sufficient?

Section A is not required.  All of section B is required.  Parts of
Python are copyrighted by, and licensed by, the four entities whose
licenses appear in section B.

> Second Observation
> Referring to sections 2 and 3 of the PSF License Version 2... Our
> non-open-source product will be a derived work, either by extending or
> embedding the python interpreter.  According to section 3, we will
> briefly summarize these modifications to the python interpreter as
> these relate to our product.  Good.

Indeed ;-)  Note that there are no specific requirements about detail,
and this is really for your benefit:  the more clearly you identify
the parts that are your original work, the easier it will be to prove
it is in fact your work (should that become necessary).

The NEWS file that comes with a Python distribution is the PSF's
version of meeting this requirement (by the similar terms in the CNRI
and BeOpen.com licenses, the PSF is also required to summarize the
modifications the PSF makes!).

> Section 2 says that we (as Licensee of the original python) have to
> include without our modified python a copy of the license.  The License
> explicitly states in section 1 that it is between the PSF and the end
> user.

The phrase "end user" appears nowhere in the license.  You must mean
Licensee.  As a Licensee, you're free (heck, encouraged) to make
derivative works, but you have no legal authority to put a /different/
license on any part of the Python you started with.  In exactly the
same way, the PSF has no legal authority to put a different license on
the versions of Python the PSF started with.

When you distribute your derived work, your "end users" have rights to
use the PSF-copyrighted portion of your work by exactly the same means
you have:  they also become Licensees of the PSF.  And of CNRI, and of
BeOpen.com, and of CWI.

The license you put on your derived work as a whole only controls the
portions of your derived work original with you, and your derived work
as a whole.  You can neither diminish nor increase the rights your end
users have wrt any of the Python code you obtained from the PSF --
that's not your work, you don't have and can't get copyright on it,
and you and your end users have exactly the same rights with respect
to it (as spelled out in the license file you're asking about).

> At the moment, we are the end user.  But when we sell our
> software as a derived work, our customer becomes the end user.  So our
> customers are entering into a direct agreement with PSF.  This
> indemnifies the PSF (sections 4,5,6,7) -- also good.
>
> But there is a side effect of section 2 that would seem to give our
> customers many rights ("to reproduce, analyze, test, perform and/or
> display publicly, prepare derivative works, distribute...") to the
> derived work we created.

I think the licenses are clear about this.  For example, the PSF
license at the top of the stack specifically says it only applies to
"Python 2.4 alone or in any derivative version".  The phrases "alone"
and "in any derivative version" intend to modify "Python 2.4".  It's
not intended to be read as the two separate phrases "Python 2.4 alone"
or "in any derivative version", and I'm not sure it even makes sense
to try to read it that way.

> We would have a problem with our cus

Re: Python to use a non open source bug tracker?

2006-10-07 Thread Tim Peters
[Giovanni Bajo]
> I understand your concerns, but I have to remember you that most bug reports
> submitted by users go totally ignored for several years, or, better, forever. 
> I
> do not have a correct statistic for this,

Indeed you do not.

> but I'm confident that at least 80% of the RFE or patches filed every week
> is totally ignored, and probably at least 50% of the bugs too.

None are /totally ignored/ -- indeed, at least I see every one as it
comes in.  You might want to change your claim to that no work
obviously visible to you is done on them.  That would be better.  But,
in fact, most bugs and patches are eventually closed, and many that
stay open involve such obscure platform-dependent mysteries that
nobody with sufficient platform expertise to resolve them appears to
exist.  For example, if you'd prefer, I'll assign all bugs and patches
involving threads on HP-UX to you from now on ;-)

These are the actual stats as of a few minutes ago:

Bugs:  938 open of 7169 total ~= 87% closed
Patches:  429 open of 3846 total ~= 89% closed
Feature Requests:  240 open of 479 total ~= 50% closed

> I think there is a much bigger problem here wrt QOS.

Well, you're confident that 80% of patches are ignored.  In reality,
89% of all patches ever submitted have been pursued to final
resolution.  Call me a stickler for detail, but something just doesn't
jibe there to my eyes ;-)

There's an easy way to improve these percentages dramatically,
although they're not bad as-is:  run thru them and close every one
that isn't entirely clear.  For example, reject every feature request,
close every patch that changes visible behavior not /clearly/ fixing a
bona fide bug, and close every "bug report" that's really a feature
request or random "but Perl/Ruby/PHP doesn't do it this way" complaint
in disguise.

The Python developers tend to keep a report open if there's a scant
non-zero chance that somebody, someday, might appear who's motivated
enough to make something of it.  If the goal was instead to make the
percentages "look good", they could easily and justifiably be
dramatically "improved" before today ends.

For example, the oldest patch open today is a speculative
implementation of rational numbers for Python.  This is really a
feature request in disguise, and has very little chance-- but not /no/
chance --of ever being accepted.  The oldest bug open today is from 6
years ago, and looks like an easy-to-answer /question/ about the
semantics of regular expressions in Python 1.6.  I could take time to
close that one now, but is that a /good/  use of time?  Yes, but, at
the moment, even finishing this reply seems to be a /better/ use of my
time -- and after that, I'm going to get something to eat ;-)

Note that I don't mean to claim that turnaround time on bugs and
patches is ideal.  To the contrary, if it's /my/ bug or patch I'm
looking at it, turnaround time sucks, and if you're looking at yours,
likewise for you.  That's what happens when there are thousands of
"you"s and a handful of "them"s, all of the latter volunteering "spare
time".

OTOH, turnaround time on Python bugs classified as critical is superb.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Question about turning off garbage collection

2006-10-05 Thread Tim Peters
[David Hirschfield]
> Question from a post to pygtk list...but it probably would be better
> answered here:
>
> I encountered a nasty problem with an external module conflicting with
> my python threads recently, and right now the only fix appears to be to
> turn off garbage collection while the critical code of the thread is
> running, and then turn it back on afterwards.
>
> Now, I don't know much about how the garbage collector works in python,
> and in order to get the thread to run without freezing, I'm wrapping the
> threaded processing function with calls to gc.disable()/gc.enable().
>
> So what's that going to do?

Stop the /cyclic/ garbage-collection algorithm from running for as
long as it remains disabled.  Most garbage collection in Python, in
typical applications most of the time, is done by reference counting,
and /that/ can never be disabled.  Reference counting alone is not
strong enough to detect trash objects in cycles (like A points to B
and B points to A and nothing else points to either; they're
unreachable trash then, but the reference count on each remains 1).
The `gc` module controls cyclic garbage collection, which is a
distinct subsystem used to find and reclaim the trash cycles reference
counting can't find on its own.

> Will calling gc.enable() put things in good shape? Will all objects created 
> while
> the garbage collector was off now be un-collectable?

No.  It has no effect except to /suspend/ running the cyclic gc
subsystem for the duration.  Trash related to cycles will pile up for
the duration.  That's all.

> I'm extremely wary of this solution, as I think anyone would be. I don't want 
> a
> suddenly super-leaky app.

As above, most garbage should continue to get collected regardless.

> Comments? Suggestions? (I know, I know, avoid threads...if only I could)

Nothing wrong with threads.  My only suggestion is to dig deeper into
/why/ something goes wrong when cyclic gc is enabled.  That smells of
a serious bug, so that disabling cyclic gc is just papering over a
symptom of a problem that will come back to bite you later.  For
example, if some piece of code in an extension module isn't
incrementing reference counts when it should, that could /fool/ cyclic
gc into believing an object is trash /while/ the extension module
believes it has a valid pointer to it.  If so, that would be a serious
bug in the extension module.  Enabling cyclic gc should not create
problems for any correctly written C code.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: weakSet

2006-10-05 Thread Tim Peters
[EMAIL PROTECTED]
> Has anyone ever think about a set wich references its elements weakly ?

Yes, and there are excruciating subtleties.  I only implemented as
much of one as ZODB needed at the time:

# A simple implementation of weak sets, supplying just enough of Python's
# sets.Set interface for our needs.

class WeakSet(object):
"""A set of objects that doesn't keep its elements alive.

The objects in the set must be weakly referencable.
The objects need not be hashable, and need not support comparison.
Two objects are considered to be the same iff their id()s are equal.

When the only references to an object are weak references (including
those from WeakSets), the object can be garbage-collected, and
will vanish from any WeakSets it may be a member of at that time.
"""

For the rest ;-), including discussion of excruciating subtleties that
arise even implementing this little, see:

http://svn.zope.org/ZODB/trunk/src/ZODB/utils.py
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python to use a non open source bug tracker?

2006-10-04 Thread Tim Peters
[Ben Finney]
>> I don't see why you're being so obtuse

[Terry Reedy]
> I think name calling is out of line here.

Name calling is always out of line on comp.lang.python.  Unless it's
done by Guido.  Then it's OK.  Anyone else, just remind them that even
Hitler had better manners.  That always calms things down again.

loving-usenet-despite-that-it's-usenet-ly y'rs  - tim
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? Python intermittently stops dead for seconds

2006-10-01 Thread Tim Peters
[Charlie Strauss]
>>> level0:  newly created objects
>>> level1:  objects that survived 1 round of garbage collection
>>> level2:  objects that survivied 2+ rounds of gargbage collection
>>>
>>> Since all of my numerous objects are level2 objects, and none of
>>> them are every deallocated, then I should never trip the GC for
>>> these.

[Fredrik Lundh]
>> your understanding of generational GC is a bit fuzzy, it seems.
>> that an object is promoted to a higher level means that it's
>> not inspected quite as often as lower-level objects, not that it's
>> never inspected at all.

[Charlie]
> As I understand it, level2 (and level1) objects only undergo gc when
> more than 10 of them is deallocated.  Since I never deallocate, this
> should not be tripped right?

No.  Cyclic gc is triggered by an excess of allocations over deallocations.

> In any case regarding your other comments:

>>> Could you clarify that for me.  GC really has three components
>>> two it:  1) finding and freeing unrefernced memory by refer
>>> refer counts 2)  cycle removal and 3)  defragementing the
>>> storage stack.  If I turn off GC, don't I lose all of these?

>> CPython always does (1), only does (2) if cycle-breaking GC isn't
>> disabled, and never does (3).

> Never does 3?

Correct.

>  then how does it compact it's memory allocation area?

It doesn't.

> Surely it does not rely on the OS to manage every small object as a
> separate memory allocation.

It doesn't do that either.  Python has its own small-object allocator,
carving up 256KB chunks obtained from the system malloc().  It's based
on size-segregated "pools" with extremely low bookkeeping overhead,
and external fragmentation in small-object memory is essentially
non-existent because of that (although it's possible to concoct
pathological programs that exhibit it).

> And just to be clear: are you saying that when I do a gc.disable this
> only turns off 2 and not 1?

Right.  Refcounting (#1) can never be disabled, and cyclic GC (#2) is
used only for trash objects that can't be reclaimed by #1 (meaning
container objects in cycles).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? Python intermittently stops dead for seconds

2006-10-01 Thread Tim Peters
[charlie strauss]
> Steve, digging into the gc docs a bit more, I think the behaviour I am seeing 
> is still
> not expected.  Namely, the program I offered has no obvious place where 
> objects
> are deallocated.  The way GC is supposed to work is thate there are three 
> levels of
> objects
>
> level0:  newly created objects
> level1:  objects that survived 1 round of garbage collection
> level2:  objects that survivied 2+ rounds of gargbage collection

Yes.

> Since all of my numerous objects are level2 objects,

No.  All newly created objects start in level0.  Over time, most (but
never all) of your objects end up in level 2.

> and none of them are every deallocated, then I should never trip the GC for 
> these.

Cyclic gc scans /all/ container objects at or under level N, whenever
cyclic gc runs.  N varies from run to run of cyclic gc, according to
the scheme described in the docs for the gc.set_threshold() function.
N=2 is certainly a possible value.  There is never a time when a live
container object becomes exempt from all future runs of cyclic gc.  If
a live container object survives two rounds of cyclic gc, it ends up
lin level2.  That doesn't mean it's never looked at again, it just
means it's not looked at during level0 or level1 (N=0 or N=1) runs of
cyclic gc.  It will still be looked at during all level2 (N=2) runs of
cyclic gc.

> Your explanation would require this to be tripped so I can't explain it.  For 
> your
> explanation to be correct then there as to be some non-obvious step in the 
> program
> that is deallocating level2 items in sufficient numbers to trip the GC.

Deallocations never trigger cyclic gc.  As the docs say, cyclic gc is
triggered by an /excess/ of allocations /over/ deallocations.  So,
e.g., if you delete container objects just as fast as you create them,
cyclic gc will never run.  But that's not what you're doing.  Instead
you're allocating new objects but never deallocating them.  That makes
cyclic gc run as frequently as it's possible for it to run.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? Python intermittently stops dead for seconds

2006-10-01 Thread Tim Peters
[charlie strauss]
> I want to clarify that, on my computer, the first instance of the gap occurs 
> way
> before the memory if filled. (at about 20% of physical ram).  Additionally the
> process monitor shows no page faults.

Python has no idea of how much RAM you have, or even of how much RAM
it's using.  See the `gc` module docs, function set_threshold(), for a
description of when Python decides to run a cyclic gc pass.  That's
all about the current excess (if any) of the number of container
allocations over the number of container deallocations since the last
time cyclic gc ran.

> ...
> (note that the array if filled as [1]*10, so there is actually only one 
> "integer",
> but 10 array elements refering to it, per foo class.)

Nevertheless cyclic gc has to examine all 10 array elements each time
the list is scanned.

> ...
> For example when I write
>
> me.memory = [1]*nfoo
>
> perhaps internally, python is allocating an array of size foo and then 
> __copying__ it
> into me.memory???

No.  Assignments in Python never copy anything.

> Since there is no reference to the intermediate it would then marked for 
> future
> garbage collection.

Nothing in Python is "marked for future garbage collection".  From
time to time /all/ container objects are scanned to determine whether
any have become cyclic trash.  This takes time proportional to the
total number of containees across all container objects.

...

> foo was an election ballot holding 10 outcomes, and bar was a set of 100 
> ballots
> from 100 voting machines, and the total array was holding the ballot sets 
> from a few
> thousand voting machines.
>
> Almost any inventory program is likely to have such a simmilar set of nested 
> array,
> so it hardly seems unusual.

For memory-consumption reasons alone, a serious  such program is
likely to use array.array objects at leaf level instead.  array.array
objects hold raw bits, not Python objects.  Since they don't hold
Python objects, they can't possibly be part of a cycle, so cyclic gc
doesn't need to scan array.array contents either.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? Python intermittently stops dead for seconds

2006-10-01 Thread Tim Peters
[Steve Holden, "pins the blame" for pauses on periodic cyclic gc]
> ...
> So basically what you have here is a pathological example of why it's
> sometimes wise to disable garbage collection. Tim, did I miss anything?

Nope!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? Python intermittently stops dead for seconds

2006-10-01 Thread Tim Peters
[charlie strauss]
>>> Below is a simple program that will cause python to intermittently
>>> stop executing for a few seconds.  it's 100% reproducible on my
>>> machine.

[Giovanni Bajo]
>> Confirmed with Python 2.4.2 on Windows.

[Jorgen Grahn]
> And Python 2.3.5 on Linux, amd64.  In fact, it causes heavy swapping so it
> will disrupt other processes, too.
>
> I didn't read the code, stupid as I am, but I trust that the behavior
> doesn't match what the code actually tries to do.

No, that's what it does:  as it goes on, it creates an ever-increasing
and unbounded number of immortal objects and lists.  The "time burps"
merely reflect that the more live containers and objects you have, the
longer it takes for cyclic gc to determine that they're not trash.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? Python intermittently stops dead for seconds

2006-10-01 Thread Tim Peters
[charlie strauss]
> Below is a simple program that will cause python to intermittently
> stop executing for a few seconds.  it's 100% reproducible on my machine.

Any program that creates a great many long-lived container objects
will behave similarly during the creation phase.  Others have
explained why.  Here's a simpler example:

from time import time
xs = []
i = 0
while 1:
i += 1
s = time()
xs.append([[1, 2] for dummy in xrange(1000)])
f = time()
if f-s > 0.25:
print "time", f-s, "on try", i
if i % 100 == 0:
print "xs made:", i
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Installing 2.5 with 2.4?

2006-09-20 Thread Tim Peters
[Duncan Booth]
>> Windows is only smart enough to avoid duplicate entries if you tell it
>> to do that. e.g.
>>
>> PATH c:\python25;c:\python25\scripts;%PATH:c:\python25;c:\python25\scripts;=%
>>
>> will add the two Python 2.5 folders to the head of the path without
>> duplicating them.

[John Machin[
> Wow .. I didn't know that! What's the syntax? Something like
> %variablename[:oldtext=[newtext]]%
> ?

Yup.

> Where is this documented?

>From a DOS box (cmd.exe), enter

set /?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Hardlinks on NTFS

2006-09-17 Thread Tim Peters
[Wildemar Wildenburger]
>> I'm thinking of letting my program create hardlinks (or symlinks). I
>> know python allows doing this for ext, reiser and the like, but
>> apparently not for ntfs systems.
>> Is there any package out there that lets me create links in a platform
>> independent way?

[Calvin Spealman]
> Why isn't NTFS handled the same as any other hardlink supporting
> filesystem? Is it a choice or a bug?

None of the above.  It's not the filesystem, it's what the platform C
supports.  POSIX specifies a small pile of standard routines for
manipulating hard and symbolic links, and Python exposes those /when/
they're available.  Microsoft's C supplies none of them.

So it's a question of someone on Windows wanting this enough to
volunteer to write code, tests, and docs, and volunteer to maintain, a
pile of Microsoft-specific code to provide workalike functionality on
Windows.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to get the "longest possible" match with Python's RE module?

2006-09-16 Thread Tim Peters
[Tim Peters]
>> [...]  The most valuable general technique [Friedl] (eventually ;-)
>> explained he called "unrolling", and consists of writing a regexp in
>> the form:
>>
>>normal* (?: special normal* )*
>>
>> where the sets of characters with which `normal` and `special` can
>> start are disjoint.  Under most "NFA" implementation, such a regexp is
>> immune to most bad behaviors, whether finding very long matches, or in
>> searching a very long string that happens not to contain any matches.

[Bryan Olson]
> So how general is that?

Between 6 and 7 ;-)

> The books just says "certain common expressions". I think the real
> answer is that one can unroll exactly the true regular expressions.

I'm sure he didn't have that much in mind; e.g., it's a real strain to
fit a fat alternation of fixed strings into that "pattern"
(cat|dog|wombat|weasel|...).'

> The unrolling is essentially resolving the states of a DFA by hand.

Oh yes, for those who've studied the theoretical underpinnings.  Most
people have not.  Much of what he gives as practical advice amounts to
ways to "trick" an "NFA" engine into doing what a "DFA" engine would
do, but explained from the POV of how a backtracking search engine
works.  And it makes /sense/ at that level too, divorced from any
knowledge of how a linear-time automaton might be constructed in
theory; e.g., you "want to" give the  search engine only one choice
because you want to avoid the possibility of needing to backtrack
later, not because you're picturing a theoretical linear-time
automaton and striving to mimic its behavior.

That explanation may not work for you, but it's effective for people
who haven't studied the theory here.  Such explanations also extend
naturally to gimmicks like backreferences,  which have no counterpart
in CompSci regexps.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: min() & max() vs sorted()

2006-09-15 Thread Tim Peters
[MRAB]
>>> Some time after reading about Python 2.5 and how the built-in functions
>>> 'min' and 'max' will be getting a new 'key' argument, I wondered how
>>> they would treat those cases where the keys were the same, for example:
>>>
>>> L = ["four", "five"]
>>> print min(L, key = len), max(L, key = len)
>>>
>>> The result is:
>>>
>>> ('four', 'four')

[Tim Peters]
>> min() and max() both work left-to-right, and return the minimal or
>> maximal element at the smallest index.

[MRAB]
> It doesn't say that in the documentation.

Right, the /language/ doesn't define anything about which specific
minimal or maximal element is returned.  Since you specifically
mentioned Python 2.5, I figured you were asking about CPython -- which
has always behaved in the way I explained.

>>> I would've thought that min(...) should return the same as
>>> sorted(...)[0] (which it does)

>> It does, but only because Python's sort is stable, so that minimal
>> elements retain their original relative order.  That implies that the
>> minimal element with smallest original index will end up at index 0
>> after sorting.

>>> and that max(...) should return the same as sorted(...)[-1] (which
it doesn't).

>> Right -- although I don't know why you'd expect that.

> Strings have index(), find(), etc which work left-to-right and
> rindex(), rfind(), etc which work right-to-left.
>
> Lists have index() but not rindex().
>
> I just thought that if min() and max() work left-to-right then for
> completeness there should also be rmin() and rmax(); alternatively,
> min() should return sorted()[0] and max() should return sorted()[-1]
> for symmetry (my personal preference).

If you were to make either of those a feature request, I don't expect
they'd gain traction -- I expect "who cares?" would be the common
challenge, and "I do" wouldn't silence it ;-)  Compelling use cases
sometimes work to get a new feature, but "completeness" or "symmetry"
almost never do on their own (they function more as sanity checks on
proposed solutions to use cases).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: how do you convert and array of doubles into floats?

2006-09-15 Thread Tim Peters
[Marc 'BlackJack' Rintsch]
>> What about:
>>
>> b = array.array('f', a)

[Diez B. Roggisch]
> AFAIK d and f are synonym for arrays, as python doesn't distinguish
> between these two on a type-level. And double it is in the end.

While Python has no type of its own corresponding to the native C
`float`, the `array` and `struct` modules do understand the native C
`float` .  A Python float (same as a C `double`) gets converted to a C
`float` when stored into one of those, and a C `float` is converted to
a Python float (C `double`) when a value is extracted.

>>> from array import array
>>> x = 1.01
>>> x
1.01
>>> array('d', [x])[0]   # full precision is preserved
1.01
>>> array('d', [x])[0] == x
True
>>> array('f', [x])[0]   # trailing bits are lost
1.0
>>> array('f', [x])[0] == x
False
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: min() & max() vs sorted()

2006-09-14 Thread Tim Peters
[MRAB]
> Some time after reading about Python 2.5 and how the built-in functions
> 'min' and 'max' will be getting a new 'key' argument, I wondered how
> they would treat those cases where the keys were the same, for example:
>
> L = ["four", "five"]
> print min(L, key = len), max(L, key = len)
>
> The result is:
>
> ('four', 'four')

min() and max() both work left-to-right, and return the minimal or
maximal element at the smallest index.

> I would've thought that min(...) should return the same as
> sorted(...)[0] (which it does)

It does, but only because Python's sort is stable, so that minimal
elements retain their original relative order.  That implies that the
minimal element with smallest original index will end up at index 0
after sorting.

> and that max(...) should return the same as sorted(...)[-1] (which it 
> doesn't).

Right -- although I don't know why you'd expect that.

> I think that's just down to a subtlety in the way that 'max' is written.

It's straightforward, skipping error cases:

def max(*args):
 is_first_arg = True
 for arg in args:
 if is_first_arg:
 winner = arg
 is_first_arg = False
elif arg > winner:# only difference in min() is "<" here instead
 winner = arg
return winner
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to get the "longest possible" match with Python's RE module?

2006-09-12 Thread Tim Peters
[Bryan Olson]
>>> Unfortunately, the stuff about NFA's is wrong. Friedl's awful
>>> book

[Tim Peters]
>> Strongly disagree: [...]

[Bryan]
> I know I'm disagreeing with a lot of smart people in panning
> the book.

That's allowed :-)

>>> What Python uses is search-and-backtrack. Unfortunately such
>>> engines don't have much theory behind them, and it's hard to
>>> reason generally about what they do.

>> Yup X 3, and the last is precisely why Friedl's book is valuable for
>> people using "NFA" implementations:  Friedl does a good job of
>> explaining when and why you're likely to trigger atrocious runtime
>> performance, and gives practical general techniques for avoiding those
>> problems.  None of that has anything to do with CompSci regexps
>> either, but is vital knowledge for people hoping to make happy
>> non-trivial use of Python/Perl/etc regexps.

> I read the first edition, minus some engine-specific stuff.
> General techniques are what I want, and I didn't see them in the
> book. He had things to do and not do, but it didn't add up to
> building (ir-)regular expressions with reliably tractable
> behavior.

My guess then is that the book moved at too slow a pace for you, and
you got bored.  The most valuable general technique he (eventually ;-)
explained he called "unrolling", and consists of writing a regexp in
the form:

normal* (?: special normal* )*

where the sets of characters with which `normal` and `special` can
start are disjoint.  Under most "NFA" implementation, such a regexp is
immune to most bad behaviors, whether finding very long matches, or in
searching a very long string that happens not to contain any matches.

Without knowing this, under many implementations it's very easy to end
up writing a regexp that matches quickly when a short match exists,
but "blows up" (stack fault, or some checked internal limit hit) if
only a very long match exists; and/or takes even exponential time to
fail to match when a match doesn't exist.

For example, a direct application is writing a robust regular
expression to match C string literals (also one form of Python string
literal:  double-quote character at each end, and backslash escapes,
including backslash-newline to span lines):

" [^"\\\n]* (?: \\[\x00-\xff] [^"\\\n]* )* "

The "normal case" is that it's just a regular old character:  not a
double quote, backslash, or newline.  The "special case" is a
backslash followed by any character whatsoever.  The "normal" and
"special" cases obviously don't intersect, so the interior of this
regexp (i.e., ignoring the double quotes at each end)  fits the
"unrolling" pattern to a tee.

As a result, you can use it with reasonably high confidence.  For
/why/ that's so, read the rest of his book ;-)  In effect, it gives
the search engine only one possible course of action at each character
in the target string.  So long as that's "normal", it has to stay in
the "normal" part of the regexp, and as soon as it's "special" it has
to leave the "normal" part.  The intent is to eliminate any
possibility for backtracking, thus ensuring predictable runtime and
ensuring that no form of internal backtracking stack needs to be used
(let alone hit its limit).

The unrolling idea was new to me, and as soon as I learned it I
replaced lots of regexps in the Python library with rewritten versions
that demonstrably performed very much better -- even allowed closing
some bug reports concerning extreme cases (extremely long matches, and
unbearable runtime on long strings without any matches).

A different lesson to take from this is that nobody is likely to
stumble into effective "NFA" techniques based on luck or "common
sense".  It's a set of obscure & specialized learned skills.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to get the "longest possible" match with Python's RE module?

2006-09-12 Thread Tim Peters
[Licheng Fang]
>> Basically, the problem is this:
>>
>> >>> p = re.compile("do|dolittle")
>> >>> p.match("dolittle").group()
>> 'do'

...

>> The Python regular expression engine doesn't exaust all the
>> possibilities, but in my application I hope to get the longest possible
>> match, starting from a given point.
>>
>> Is there a way to do this in Python?

[Bryan Olson]
> Yes. Here's a way, but it sucks real bad:
>
>
>  def longest_match(re_string, text):
> regexp = re.compile('(?:' + re_string + ')$')
>  while text:
> m = regexp.match(text)
>  if m:
>  return m
>  text = text[:-1]
>  return None

If you want to try something like that, note that the match() method
accepts optional slice arguments, so the "text = text[:-1]" business
can be replaced with much quicker little-integer arithmetic.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to get the "longest possible" match with Python's RE module?

2006-09-12 Thread Tim Peters
[Licheng Fang]
>> Oh, please do have a look at the second link I've posted. There's a
>> table comparing the regexp engines. The engines you've tested probably
>> all use an NFA implementation.

[Bryan Olson]
> Unfortunately, the stuff about NFA's is wrong. Friedl's awful
> book

Strongly disagree:  it's an excellent book about the /pragmatics/ of
using "regular expressions" as most widely implemented.  It's not at
all about "regular expressions" in the  CompSci sense of the term,
which appears to be your complaint.

> was the first time I saw this confusion about what NFA is;
> I don't know if he originated the mess or just propagated it.

As far as I could tell at the time, he originated it.  I'm not happy
about that either.

> "Nondeterministic finite automata" is well defined in theory
> of computation. The set of languages accepted by NFA's is
> exactly the same as the set accepted by DFA's.

And which has very little to do with "regular expressions" as most
widely implemented -- gimmicks like backreferences are wholly outside
the DFA formalism.

> What Python uses is search-and-backtrack. Unfortunately such
> engines don't have much theory behind them, and it's hard to
> reason generally about what they do.

Yup X 3, and the last is precisely why Friedl's book is valuable for
people using "NFA" implementations:  Friedl does a good job of
explaining when and why you're likely to trigger atrocious runtime
performance, and gives practical general techniques for avoiding those
problems.  None of that has anything to do with CompSci regexps
either, but is vital knowledge for people hoping to make happy
non-trivial use of Python/Perl/etc regexps.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to get the "longest possible" match with Python's RE module?

2006-09-11 Thread Tim Peters
[Licheng Fang[
> ...
> In fact, what I'm doing is handle a lot of regular expressions. I
> wanted to build VERY LONG regexps part by part and put them all into a
> file for easy modification and maintenance. The idea is like this:
>
> (*INT) = \d+
> (*DECIMAL) = (*INT)\.(*INT)
> (*FACTION) = (*DECIMAL)/(*DECIMAL)
> (*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
> ... ...
>
> What's inside the sytactically wrong (* and ) is something to be
> replaced, and then I wrote a little script to do the string
> substitution, to get very long regexps to be compiled. I thought that
> might be a way to handle long and complex regexps, and in this course I
> encountered the problem with the semantics of '|'.
>
> My approach may sound desperate and funny, but please, do you have any
> good idea as to how to handle long and complex regexps?

My best advice is to find a different approach entirely.  For example,
build a parser using parser technology, and use regexps in that /only/
to do gross tokenization ("string of digits", "decimal point", ...):
build the rest with a grammar.

Regexps are a brittle tool, best tolerated in small doses.  For an
"NFA" implementation like Python's, you're likely to experience poor
speed when combining many complex regexps, and /especially/ when
alternatives are ambiguous wrt prefixes (and yours are, else you
wouldn't have a problem with "longest match" versus "some shorter
match" to begin with.  OTOH, under a "DFA" implementation such as
POSIX grep's, you're likely to experience exponential memory
requirements (DFA implementations can need to build enormous state
machines, tracing out in advance all possible paths through all the
regexps applied to all possible input strings).

Just sounds to me like the wrong tool for the job.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to get the "longest possible" match with Python's RE module?

2006-09-11 Thread Tim Peters
[Licheng Fang]
> ...
> I want to know if there is some way to make Python RE behave like grep
> does,

Not in general, no.  The matching strategies couldn't be more
different, and that's both deep and intentional.  See Friedl's book
for details:

http://regex.info/

> or do I have to change to another engine?

Yes, if POSIX regexp semantics are what you require.  Several years
ago there was an extension module for Python supplying POSIX
semantics, but I couldn't find anything current just now in a minute
of searching.  You may be more motivated to search longer ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: GC and security

2006-08-30 Thread Tim Peters
[Aahz]
>> Assuming you're talking about CPython, strings don't really participate
>> in garbage collection.  Keep in mind that the primary mechanism for
>> reaping memory is reference counting, and generally as soon as the
>> refcount for an object goes to zero, it gets deleted from memory.

[Les Schaffer]
> ok so far ...

Not really ;-)  Refcounting is /a/ "garbage collection" technique, and
drawing a distinction between refcount-based reclamation and other
ways CPython reclaims memory has no bearing on your question.

>> Garbage collection only gets used for objects that refer to other
>> objects, so it would only apply if string refcounts are being held by
>> GC-able objects.

> you lost me by here ...

That tends to happen when an irrelevant distinction gets made ;-)

> is there something different about string objects than other objects in
> Python?

Not anything relevant wrt garbage collection.  It may be relevant that
strings are immutable, since that prevents you from overwriting a
string's contents yourself.

> are you saying EVERY string in Python stays in memory for the lifetime
> of the app?

He wasn't, and they don't.

>> Also keep in mind, of course, that deleting objects has nothing to do
>> with whether the memory gets overwritten...

> no chance of being overwritten until deleted, no?

True.

> and once deleted over time there is some probability of being
> overwritten, no?

True.  Also true if you add the intended "non-zero" qualifier to
"probability" ;-)

> and i am curious how that works.

Purely accidental -- nothing guaranteed -- details can (& do) change
across releases.  Read obmalloc.c for a tour of the accidents du jour.

> it sounds like you are saying once a string, always the same string, in 
> python.
> if thats true, i am glad i know that.

Not true, so be glad to forget it.

A curious possibility:  if you do a debug build of Python, obmalloc.c
arranges to overwrite all of an object's memory as soon as the object
is reclaimed (by any means, refcounting or otherwise).  That wasn't
for "security" (faux or otherwise), it's to make it easier to detect
buggy C code referencing freed memory.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A Sort Optimization Technique: decorate-sort-dedecorate

2006-08-29 Thread Tim Peters
[/T]
>> OTOH, current versions of Python (and Perl)

[/F]
> just curious, but all this use of (& Perl) mean that the Perl folks have
> implemented timsort ?

A remarkable case of independent harmonic convergence:

http://mail.python.org/pipermail/python-dev/2002-July/026946.html

Come to think of it, I don't actually know whether a /released/ Perl
ever contained the development code I saw.  Looks like it got added to
Perl 5.8:

http://perldoc.perl.org/sort.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A Sort Optimization Technique: decorate-sort-dedecorate

2006-08-29 Thread Tim Peters
 [Joachim Durchholz]
>>>>> Wikipedia says it's going from 2NlogN to N. If a sort is massively
>>>>> dominated by the comparison, that could give a speedup of up to 100%
>>>>> (approximately - dropping the logN factor is almost irrelevant, what
>>>>> counts is losing that factor of 2).

[Gabriel Genellina]
>>>> In fact it's the other way - losing a factor of 2 is irrelevant,
>>>> O(2N)=O(N). The logN factor is crucial here.

[Joachim Durchholz]
>>> That's just a question of what you're interested in.
>>>
>>> If it's asymptotic behavior, then the O(logN) factor is a difference.
>>>
>>> If it's practical speed, a constant factor of 2 is far more relevant
>>> than any O(logN) factor.

[Tim Peters]
>> Nope.  Even if you're thinking of base 10 logarithms, log(N, 10) > 2
>> for every N > 100.  Base 2 logarithms are actually most appropriate
>> here, and log(N, 2) > 2 for every N > 4.  So even if the "2" made
>> sense here (it doesn't -- see next paragraph), the log(N) term
>> dominates it for even relatively tiny values of N.

[Joachim Durchholz]
> Whether this argument is relevant depends on the constant factors
> associated with each term.

I'm afraid you're still missing the point in this example:  it's not
just that Python's (& Perl's) current sorts do O(N*log(N)) worst-case
comparisons, it's that they /do/ N*log(N, 2) worst-case comparisons.
O() notation isn't being used, and there is no "constant factor" here:
 the count of worst-case comparisons made is close to exactly N*log(N,
2), not to some mystery-constant times N*log(N, 2).  For example,
sorting a randomly permuted array with a million distinct elements
will require nearly 100*log(100, 2) ~= 100 * 20 = 20
million comparisons, and DSU will save about 19 million key
computations in this case.  O() arguments are irrelevant to this, and
the Wikipedia page you started from wasn't making an O() argument
either:

http://en.wikipedia.org/wiki/Schwartzian_transform

For an efficient ordinary sort function, the number of invocations of the
transform function goes from an average of 2nlogn to n;

No O() in sight, and no O() was intended there either.  You do exactly
N key computations when using DSU, while the hypothetical "efficient
ordinary sort function" the author had in mind does about 2*N*log(N,
2) key computations when not using DSU.  That's overly pessimistic for
Python's & Perl's current sort functions, where no more than N*log(N,
2) key computations are done when not using DSU.  The /factor/ of key
computations saved is thus as large as N*log(N, 2) / N = log(N, 2).
O() behavior has nothing to do with this result, and the factor of
log(N, 2) is as real as it gets.  If key extraction is at all
expensive, and N isn't trivially small, saving a factor of log(N, 2)
key extractions is /enormously/ helpful.

If this is sinking in now, reread the rest of my last reply that got
snipped hre.

> Roughly speaking, if the constant factor on the O(N) term is 100 and the
> constant factor on the O(logN) term is 1, then it's still irrelevant.

As above, it's talking about O() that's actually irrelevant in this
specific case.

> My point actually is this: big-Oh measures are fine for comparing
> algorithms in general, but when it comes to optimizing concrete
> implementations, its value greatly diminishes: you still have to
> investigate the constant factors and all the details that the big-Oh
> notation abstracts away.

That's true, although the argument here wasn't actually abstracting
away anything.  You've been adding abstraction to an argument that
didn't have any ;-)

> From that point of view, it's irrelevant whether some part of the
> algorithm contributes an O(1) or an O(logN) factor: the decision where
> to optimize is almost entirely dominated by the constant factors.

While true in some cases, it's irrelevant to this specific case.
More, in practice a factor of O(log(N)) is almost always more
important "for real" than a factor of O(1) anyway -- theoretical
algorithms hiding gigantic constant factors in O() notion are very
rarely used in real life.  For example, the best-known algorithm for
finding the number of primes <= x has O(sqrt(x)) time complexity, but
AFAIK has /never/ been implemented because the constant factor is
believed to be gigantic indeed.  Theoretical CompSci is full of
results like that, but they have little bearing on practical
programming.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A Sort Optimization Technique: decorate-sort-dedecorate

2006-08-29 Thread Tim Peters
[Joachim Durchholz]
>>> Wikipedia says it's going from 2NlogN to N. If a sort is massively
>>> dominated by the comparison, that could give a speedup of up to 100%
>>> (approximately - dropping the logN factor is almost irrelevant, what
>>> counts is losing that factor of 2).

[Gabriel Genellina]
>> In fact it's the other way - losing a factor of 2 is irrelevant,
>> O(2N)=O(N). The logN factor is crucial here.

[Joachim Durchholz]
> That's just a question of what you're interested in.
>
> If it's asymptotic behavior, then the O(logN) factor is a difference.
>
> If it's practical speed, a constant factor of 2 is far more relevant
> than any O(logN) factor.

Nope.  Even if you're thinking of base 10 logarithms, log(N, 10) > 2
for every N > 100.  Base 2 logarithms are actually most appropriate
here, and log(N, 2) > 2 for every N > 4.  So even if the "2" made
sense here (it doesn't -- see next paragraph), the log(N) term
dominates it for even relatively tiny values of N.

Now it so happens that current versions of Python (and Perl) use merge
sorts, where the worst-case number of comparisons is roughly N*log(N,
2), not Wikipedia's 2*N*log(N, 2) (BTW, the Wikipedia article
neglected to specify the base of its logarithms, but it's clearly
intended to be 2).  So that factor of 2 doesn't even exist in current
reality:  the factor of log(N, 2) is the /whole/ worst-case story
here, and, e.g., is near 10 when N is near 1000.  A factor of 10 is
nothing to sneeze at.

OTOH, current versions of Python (and Perl) also happen to use
"adaptive" merge sorts, which can do as few as N-1 comparisons in the
best case (the input is already sorted, or reverse-sorted).  Now I'm
not sure why anyone would sort a list already known to be sorted, but
if you do, DSU is a waste of time.  In any other case, it probably
wins, and usually wins big, and solely by saving a factor of (up to)
log(N, 2) key computations.

> (I'm not on the list, so I won't see responses unless specifically CC'd.)

Done.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: naive misuse? (of PyThreadState_SetAsyncExc)

2006-08-28 Thread Tim Peters
[EMAIL PROTECTED]
>> The documentation for PyThreadState_SetAsyncExc says "To prevent naive
>> misuse, you must write your own C extension to call this". Anyone care
>> to list a few examples of such naive misuse?

[and again]
> No? I'll take that then as proof that it's impossible to misuse the
> function.

That's wise ;-)  Stopping a thread asynchronously is in /general/ a
dangerous thing to do, and for obvious reasons.  For example, perhaps
the victim thread is running in a library routine at the time the
asynch exception is raised, and getting forcibly ejected from the
normal control flow leaves a library-internal mutex locked forever.
Or perhaps a catch-all "finally:" clause in the library manages to
release the mutex, but leaves the internals in an inconsistent state.
Etc.  The same kinds of potential disasters accout for why Java
deprecated its versions of this gimmick:

http://java.sun.com/j2se/1.3/docs/guide/misc/threadPrimitiveDeprecation.html

That said, you can invoke PyThreadState_SetAsyncExc() from Python
using the `ctypes` module (which will be included in 2.5, and is
available as an extension module for earlier Pythons).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Misleading error message when opening a file (on Windows XP SP 2)

2006-08-28 Thread Tim Peters
[Claudio Grondi]
> Here an example of what I mean
> (Python 2.4.2, IDLE 1.1.2, Windows XP SP2, NTFS file system, 80 GByte
> large file):
>
>  >>> f = file('veryBigFile.dat','r')
>  >>> f = file('veryBigFile.dat','r+')
>
> Traceback (most recent call last):
>File "", line 1, in -toplevel-
>  f = file('veryBigFile.dat','r+')
> IOError: [Errno 2] No such file or directory: 'veryBigFile.dat'
>
> Is it a BUG or a FEATURE?

Assuming the file exists and isn't read-only, I bet it's a Windows
bug, and that if you open in binary mode ("r+b") instead I bet it goes
away (this wouldn't be the first large-file text-mode Windows bug).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: time.clock() going backwards??

2006-08-25 Thread Tim Peters
[Giovanni Bajo[
>> I experimented something very strange, a few days ago. I was debugging an
>> application at a customer's site, and the problem turned out to be that
>> time.clock() was going "backwards", that is it was sometimes
>> (randomically) returning a floating point value which was "less than" the
>> value returned by the previous invokation. The computer was a pretty fast
>> one (P4 3Ghz I think, running Windows XP), and this happened only between
>> very close invokations of time.clock().

[Terry Reed]
> I seem to remember this being mentioned before on the list, perhaps by Tim
> Peters.  Perhaps he will chime in.

No real need ;-)  BIOS or HAL bug on a multiprocessor (or maybe even
hyperthreaded) box is by far the most likely cause (and the only cause
ever identified for sure by people who followed up).  Python's C code
slinging QueryPerformanceCounter is exceedingly simple, and isn't a
suspect because of that.  It's on the edge of vague possibility that
Microsoft's compiler generates non-monotonic code for converting
64-bit integer to double:  i.e., Python's C code obviously relies on
that if i and j are _int64 satisfying i >= j, then (double)i >=
(double)j.  If that type-conversion code /is/ monotonic (which is
almost certainly so), then the only ways the C code can fail are if
the HW counter overflows (in which case time.clock() would "jump" a
huge amount), or if the sequence of values returned by
QueryPerformanceCounter is itself non-monotonic at times (which is
consistent with known BIOS/HAL bugs).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: random writing access to a file in Python

2006-08-25 Thread Tim Peters
[Claudio Grondi]
> I have a 250 Gbyte file (occupies the whole hard drive space)

Then where is Python stored ;-)?

> and want to change only eight bytes in this file at a given offset of appr. 
> 200
> Gbyte (all other data in that file should remain unchanged).
>
> How can I do that in Python?

Same way you'd do it in C, really:

f = open(PATH_TO_FILE, "r+b")
f.seek(appr. 200 Gbyte)
f.write(A_STRING_CONTAINING_YOUR_NEW_8_BYTES)
f.close()

This depends on your operating system and file system and platform C
library supporting seeks to very large offsets.  For example, Windows
using NTFS does.  Try it.  Python should complain (raise an exception
during the seek() attempt) if your box refuses to cooperate.  Use as
recent a released version of Python as you can.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: pickling and endianess

2006-08-24 Thread Tim Peters
[Chandrashekhar kaushik]
> Can an object pickled and saved on a little-endian machine be unpickled
> on a big-endian machine ?

Yes.  The pickle format is platform-independent (native endianness
doesn't matter, and neither do the native sizes of C's various integer
types).

> Does python handle this in some manner  , by say having a flag to indicate
> the endianess of the machine on which the object was pickled ?  And then
> using this flag to decide how that data will be unpickled ?

Not a flag, no, the pickle format is the same on all platforms.  If
you need to know details, read pickle.py.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: _PyLong_FromByteArray

2006-08-12 Thread Tim Peters
[Dan Christensen]
> My student and I are writing a C extension that produces a large
> integer in binary which we'd like to convert to a python long.  The
> number of bits can be a lot more than 32 or even 64.  My student found
> the function _PyLong_FromByteArray in longobject.h which is exactly
> what we need, but the leading underscore makes me wary.  Is it safe to
> use this function?

Python uses it internally, so it better be ;-)

>  Will it continue to exist in future versions of python?

No guarantees, and that's why it has a leading underscore:  it's not
an officially supported, externally documented, part of the advertised
Python/C API.  It so happens that I added that function, because
Python needed some form of its functionality internally across
different C modules.  Making it an official part of the Python/C API
would have been a lot more work (which I didn't have time for), and
created an eternal new maintenance burden (which I'm not keen on
regardless ;-)).

In practice, few people touch this part of Python's implementation, so
I don't /expect/ it will go away, or even change, for years to come.
The biggest insecurity I can think of offhand is that someone may
launch a crusade to make some other byte-array <-> long interface
"official" based on a different way of representing negative integers.
 But even then I expect the current unofficial functions to remain,
since the 256's-complement representation remains necessary for the
`struct` module's "q" format, and for the `pickle` module's protocol=2
long serialization format.

> Or is there some other method we should use?

No.  That's why these functions were invented to begin with ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Inconsistency producing constant for float "infinity"

2006-08-12 Thread Tim Peters
[Tim Peters]
>...
>> It has a much better chance of working from .pyc in Python 2.5.
>> Michael Hudson put considerable effort into figuring out whether the
>> platform uses a recognizable IEEE double storage format, and, if so,
>> marshal and pickle take different paths that preserve infinities,
>> NaNs, and signed zeroes.

[Alex Martelli]
> Isn't marshal constrained to work across platforms (for a given Python
> release), and pickle also constrainted to work across releases (for a
> given protocol)?

Yes to both.

> I'm curious about how this still allows them to "take different
> paths" (yeah, I _could_ study the sources, but I'm lazy:-)...

Good questions.  Pickle first:  pickle, with protocol >= 1, has always
had a binary format for Python floats, which is identical to the
big-endian IEEE-754 double-precision storage format.  This is
independent of the native C double representation:  even on a non-IEEE
box (e.g, VAX or Cray), protocol >= 1 pickle does the best it can to
/encode/ native doubles in the big-endian 754 double storage /format/.
 This is explained in the docs for the "float8" opcode in
pickletools.py:

 The format is unique to Python, and shared with the struct
 module (format string '>d') "in theory" (the struct and cPickle
 implementations don't share the code -- they should).  It's
 strongly related to the IEEE-754 double format, and, in normal
 cases, is in fact identical to the big-endian 754 double format.
 On other boxes the dynamic range is limited to that of a 754
 double, and "add a half and chop" rounding is used to reduce
 the precision to 53 bits.  However, even on a 754 box,
 infinities, NaNs, and minus zero may not be handled correctly
 (may not survive roundtrip pickling intact).

The problem has been that C89 defines nothing about signed zeroes,
infinities, or NaNs, so even on a 754 box there was no consistency
across platforms in what C library routines like frexp() returned when
fed one of those things.  As a result, what Python's "best non-heroic
effort" code for constructing a 754 big-endian representation actually
did was a platform-dependent accident when fed a 754 special-case
value.  Likewise for trying to construct a native C double from a 754
representation of a 754 special-case value -- again, there was no
guessing what C library routines like ldexp() would return in those
cases.

Part of what Michael Hudson did for 2.5 is add code to guess whether
the native C double format /is/ the big-endian or little-endian 754
double-precision format.  If so, protocol >= 1 pickle in 2.5 uses much
simpler code to pack and unpack Python floats, simply copying from/to
native bytes verbatim (possibly reversing the byte order, depending on
platform endianness).  Python doesn't even try to guess whether a C
double is "normal", or an inf, NaN, or signed zero then, so can't
screw that up -- it just copies the bits blindly.

That's much better on IEEE-754 boxes, although I bet it still has
subtle problems.  For example, IIRC, 754 doesn't wholly define the
difference in storage formats for signaling NaNs versus quiet NaNs, so
I bet it's still theoretically possible to pickle a signaling NaN on
one 754 box and get back a quiet NaN (or vice versa) when unpickled on
a different 754 box.

Protocol 0 (formerly known as "text mode") pickles are still a crap
shoot for 754 special values, since there's still no consistency
across platforms in what the C string<->double routines produce or
accept for special values.

Now on to marshal.  Before 2.5, marshal only had a "text mode" storage
format for Python floats, much like protocol=0 pickle.  So, as for
pickle protocol 0, what marshal produced or reconstructed for a 754
special value was a platform-dependent accident.

Michael added a binary marshal format for Python floats in 2.5, which
uses the same code protocol >= 1 pickle uses for serializing and
unserializing Python floats (except that the marshal format is
little-endian instead of big-endian).  These all go thru
floatobject.c's _PyFloat_Pack8 and _PyFloat_Unpack8 now, and a quick
glance at those will show that they take different paths according to
whether Michael's native-format-guessing code decided that the native
format was ieee_big_endian_format, ieee_little_endian_format, or
unknown_format.  The long-winded pre-2.5 pack/unpack code is only used
in the unknown_format case now.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Inconsistency producing constant for float "infinity"

2006-08-11 Thread Tim Peters
[Peter Hansen]
>> I'm investigating a puzzling problem involving an attempt to
>> generate a constant containing an (IEEE 754) "infinity" value.  (I
>> understand that special float values are a "platform-dependent
>> accident" etc...)

[also Peter]
> ...
> My guess about marshal was correct.

Yup.

> The problem (value becoming 1.0) appears when running from .pyc
> files.  Immediately after the source code is changed, the code works,
> since it doesn't unmarshal the .pyc file but just works from the
> bytecode freshly compiled in memory.
>
> This demonstrates what would be the heart of the problem, which I guess
> means this is not surprising to almost anyone, but perhaps will be a
> wakeup call to anyone who might still be unaware and has code that
> relies on constants like 1e producing infinities:

It has a much better chance of working from .pyc in Python 2.5.
Michael Hudson put considerable effort into figuring out whether the
platform uses a recognizable IEEE double storage format, and, if so,
marshal and pickle take different paths that preserve infinities,
NaNs, and signed zeroes.

For Pythons earlier than that, it's a better idea to avoid trying to
express special values as literals.  That not only relies on platform
accidents about how marshal works, but also on platform accidents
about the C string->float library works.  For example, do instead:

inf = 1e300 * 1e300
nan = inf - inf

If you're /on/ an IEEE-754 platform, those will produce an infinity
and a NaN, from source or fom .pyc.

>  >>> import marshal
>  >>> marshal.dumps(1e666)
> 'f\x061.#INF'
>  >>> marshal.loads(marshal.dumps(1e666))
> 1.0

Illustrating the remarkable truth that the Microsoft C float->string
routines can't read what the MS string->float routines produce for
special values (try reading "1.#INF" from C code with a double format
and you'll also get 1.0 back).

Here in Python 2.5, on Windows (note that marshal grew a new binary
float format, primarily to support this):

>>> import marshal
>>> marshal.dumps(1e666)
'g\x00\x00\x00\x00\x00\x00\xf0\x7f'
>>> marshal.loads(_)
1.#INF
>>> float.__getformat__('double')   # also new in 2.5
'IEEE, little-endian'
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: datetime to timestamp

2006-08-11 Thread Tim Peters
[Simen Haugen]
>>> How can I convert a python datetime to a timestamp? It's easy to convert
>>> a timestamp to datetime (datetime.datetime.fromtimestamp(), but the
>>> other way around...?)

[John Machin]
>> Is the timetuple() method what you want?
>>
>> #>>> import datetime
>> #>>> n = datetime.datetime.now()
>> #>>> n
>> datetime.datetime(2006, 8, 11, 23, 32, 43, 109000)
>> #>>> n.timetuple()
>> (2006, 8, 11, 23, 32, 43, 4, 223, -1)

[also John]
> Aaaarrrggghhh no it's not what you want

Yes, it is ;-)

> -- looks like you have to do the arithmetic yourself, starting with 
> toordinal()

It's just one step from the time tuple:

import time
time.mktime(some_datetime_object.timetuple())

The datetime module intended to be an island of relative sanity.
Because the range of dates "timestamps" can represent varies across
platforms (and even "the epoch" varies), datetime doesn't even try to
produce timestamps directly -- datetime is more of an alternative to
"seconds from the epoch" schemes.  Because datetime objects have
greater range and precision than timestamps, conversion is
problem-free in only one direction.  It's not a coincidence that
that's the only direction datetime supplies ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: threading.Event usage causing intermitent exception

2006-08-08 Thread Tim Peters
[EMAIL PROTECTED]
> Admittedly this problem causes no actual functional issues aside from
> an occasional error message when the program exits.  The error is:
>
> Unhandled exception in thread started by
> Error in sys.excepthook:
> Original exception was:
>
> Yes all that info is blank.

That's typical when the interpreter has torn so much of itself down
that there's not enough left in the `sys` module even to print
exception info gracefully.  The easiest way to stop that is to stop
/trying/ to run Python code while the interpreter is tearing itself
down.

> The application is a console application that is waiting for some
> condition on the machine to happen.  However, I leave open the
> possiblitiy to cancel by a single key press at which
> point the program terminates.  Suffice it to say, I cannot perform both
> checks without invoking threads as the key press gets "missed"
> sometimes.  Below is a simplification of the code
>
> canceled = False
> myEvent = threading.Event()
>
> def watchForCancel()
> global canceled
> # turn of terminal buffering and capture key presses here
> canceled = True
> myEvent.set()

Presumably this is some kind of poll-and-sleep loop?  If so, add a
check to get out of the loop if myEvent.isSet().  Or if not, make it
some kind of poll-and-sleep loop ;-)

> def watchForCondition()
> # do a bunch of stuff checking the system
> myEvent.set()

Ditto.

> cancelThread = threading.Thread(target = watchForCancel)
> cancelThread.setDaemon(True)  # so I can exit the program when I want to

And get rid of that.  The comment doesn't make sense to me, and
forcing a thread to be a daemon is exactly what /allows/ the thread to
keep running while the interpreter is tearing itself down.  That's why
"daemonism" isn't the default:  it's at best delicate.  I don't see a
real reason for wanting this here.

> cancelThread.start()
> conditionThread = threading.Thread(target = watchForCondition)
> conditionThread.setDaemon(True)

Ditto.

> conditionThread.start()
>
> myEvent.wait()
>
> if cancelled:
> sys.exit(2)
>
> # do more stuff if the condition returned instead of cancel and then
> I'm done
>
>
> I've left out most of the active code, just cuz I think it muddies the
> water.  Now about 9 out of 10 times this works just fine.  However,
> every once in a while I get the exceptions mentioned above, but only
> when I cancel out of the operation.  I think the conditionThread is in
> the process of shutting down and gets hosed up somehow and spits out an
> exception, but the interpreter no longer has access to the info since
> it is shutting down.

At this point it's likely that even sys.stdout and sys.stderr no
longer exist.  The "Unhandled exception" message is printed directly
to the C-level `stderr` instead.

> ...
> I suppose I could make the threads aware of each other, but that just
> seems stupid.  Any suggestions on how to eliminate this intermittent
> error?

Stop forcing them to be daemon threads.  The interpreter won't start
to tear itself down then before both threads terminate on their own.
To arrange for that, it's not necessary for the threads to become
aware of each other, but it is necessary for the threads to become
aware of another (shared) reason /for/ exiting.  The most natural way
to do that, given what you said above, is to make both threads aware
that their shared myEvent event may get set "externally", and to stop
when they find it has been set.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Initializing the number of slots in a dictionary

2006-08-06 Thread Tim Peters
...

[Jon Smirl]
> I know in advance how many items will be added to the dictionary. Most
> dictionary implementations I have previously worked with are more
> efficient if they know ahead of time how big to make their tables.

Richard Jones spent considerable time investigating whether
"pre-sizing" lists and dicts in CPython could help, at the "Need For
Speed" sprint earlier this year.   He didn't find a win worth getting;
e.g., read the section "List pre-allocation" at:

http://wiki.python.org/moin/NeedForSpeed/Failures

Trying it for dicts was also part of what he did, but I don't think
specifics about that were recorded on the Wiki.  I was at the sprint,
and hoots of triumph from Richard's direction were conspicuous by
absence during his dict time ;-)

> In this case I only need to do a presence test of the key, there is no
> actual data associated with the key. The table is used for detecting
> duplicate entries. Is there a more efficient to do this test that sticking
> an empty string into a dict? The keys are sha1.digest().

It's probably more common to set the value to None or 1, but it
doesn't really matter in reality.  In theory, using 1 or an empty
string relies on the implementation accident that CPython stores those
uniquely, while it's guaranteed that None is a singleton object.

BTW, is there a reason to use SHA instead of MD5?  I ask because the
latter is 4 bytes shorter, and you apparently have a /lot/ of keys.

If you're using a current version of Python, you can save some memory
by using a builtin set object instead.  The CPython implementation of
sets is very much like its implementation of dicts, but doesn't
consume any memory for the (non-existent) associated values.  You also
get a number of operations useful on sets (like intersection and
union).

> ...
> Since I am rerunning the conversion over and over anything simple that speeds
> it up is helpful.
>
> I already have 4GB RAM, it would take around 30GB to get everything in memory.
>
> Dictionaries are not a big problem for me, but there are many in use and
> they have millions of items in them.

A peculiar suggestion:  if you don't need to release system resources
cleanly at the end of a run, try doing:

import os
os._exit(0)

at the end.  If you have dicts with millions of entries swapped to
disk, it /can/ consume major time just to page them all in again to
decrement the key & value refcounts if you let "clean shutdown" code
determine they've become trash.  Bailing ungracefully can skip all
that work (but also skips other work, like letting the platform C I/O
library close still-open files gracefully).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: random shuffles

2006-07-21 Thread Tim Peters
[ Boris Borcic]
> x.sort(cmp = lambda x,y : cmp(random.random(),0.5))
>
> pick a random shuffle of x with uniform distribution ?

Say len(x) == N.  With Python's current sort, the conjecture is true
if and only if N <= 2.

> Intuitively, assuming list.sort() does a minimal number of comparisons to
> achieve the sort,

No idea what that could mean in a rigorous, algorithm-independent way.

> I'd say the answer is yes. But I don't feel quite confortable
> with the intuition... can anyone think of a more solid argumentation ?

If a list is already sorted, or reverse-sorted, the minimum number of
comparisons needed to determine that is N-1, and Python's current sort
achieves that.  It first looks for the longest ascending or descending
run.  If comparison outcomes are random, it will decide "yes, already
sorted" with probability 1/2**(N-1), and likewise for reverse-sorted.
When N > 2, those are larger than the "correct" probabilities 1/(N!).
So., e.g., when N=3, the sort will leave the list alone 1/4th of the
time (and will reverse the list in-place another 1/4th).  That makes
the identity and reversal permutations much more likely than "they
should be", and the bias is worse as N increases.

Of course random.shuffle(x) is the intended way to effect a
permutation uniformly at random.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: hash() yields different results for different platforms

2006-07-12 Thread Tim Peters
[Grant Edwards]
>> ...
>> The low 32 bits match, so perhaps you should just use that
>> portion of the returned hash?
>>
>> >>> hex(12416037344)
>> '0x2E40DB1E0L'
>> >>> hex(-468864544 & 0x)
>> '0xE40DB1E0L'
>>
>> >>> hex(12416037344 & 0x)
>> '0xE40DB1E0L'
>> >>> hex(-468864544 & 0x)
>> '0xE40DB1E0L'

[Qiangning Hong]
> Is this relationship (same low 32 bits) guaranteed?

No.  Nothing about hashes is guaranteed, except that when x and y are
of a hashable type, and x == y, then hash(x) == hash(y) too.

> Will it change in the future version?

That's possible, but not planned.  Note that the guts of string
hashing in CPython today is implemented via

while (--len >= 0)
x = (103*x) ^ *p++;

where x is C type "long", and the C language doesn't even define what
that does (behavior when signed multiplication overflows isn't defined
in C).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Valgrind memory-checker reports memory problems in Python

2006-07-04 Thread Tim Peters
[Nathan Bates]
> Are the Python developers running Python under Valgrind?

Please read Misc/README.valgrind (in your Python distribution).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: how to stop python...

2006-07-02 Thread Tim Peters
[bruce]
> perl has the concept of "die". does python have anything similar. how can a
> python app be stopped?
>
> the docs refer to a sys.stop.

Python docs?  Doubt it ;-)

> but i can't find anything else... am i missing something...

>>> import sys
>>> print sys.exit.__doc__
exit([status])

Exit the interpreter by raising SystemExit(status).
If the status is omitted or None, it defaults to zero (i.e., success).
If the status is numeric, it will be used as the system exit status.
If it is another kind of object, it will be printed and the system
exit status will be one (i.e., failure).



Of course there's nothing to stop you from catching SystemExit either
-- it's just another exception, but one that happens to shut down the
interpreter in the specified way if it isn't caught & handled.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Questions on Threading

2006-06-26 Thread Tim Peters
[j.c.sackett]
> I'm using the threading module to accomplish some distributed processing on
> a project, and have a basic (I hope) question that I can't find an answer to
> elsewhere.
>
> I've noted that there's a lot of documentation saying that there is no
> external way to stop a thread,

True.

> and yet when creating a thread through the threading module I can see
> that it has a method called _Thread__stop, which does indeed stop it.

You're misreading the code then.  __stop() is used internally by a
thread T to record when T itself knows it's _about_ to stop.  Other
threads may be blocked in T.join() calls, and the purpose of T calling
T.__stop() is to notify such threads they can proceed.

> Is this in any way dangerous ( i.e. doesn't actually stop the thread,

That's right.

> or causes other thread timing issues down the road?)

If you call T.__stop(), T will keep running (exactly as if you hadn't
called it), but threads waiting on T.join() will think T has stopped.

> Or is there some other reason that there is officially no stop method?

Python never offered it primarily for the same reasons Java regretted
offering it (so much so that Java eventually deprecated it):

http://java.sun.com/j2se/1.3/docs/guide/misc/threadPrimitiveDeprecation.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Queue.Queue.queue.clear() thread-safe?

2006-06-22 Thread Tim Peters
[Russell Warren]
>>> I'm guessing no, since it skips down through any Lock semantics,

[Tim Peters]
>> Good guess :-)  It's also unsafe because some internal conditions must
>> be notified whenever the queue becomes empty (else you risk deadlock).

[Fredrik Lundh]
> "also" ?  if it weren't for the other things, the clear() call itself
> would have been atomic enough, right ?

Define "other things" :-)  For example, Queue.get() relies on that
when self._empty() is false, it can do self._get() successfully.
That's 100% reliable now because the self.not_empty condition is in
the acquired state across both operations.  If some yahoo ignores the
underlying mutex and clears the queue after self._empty() returns true
but before pop() executes self._get(), the call to the latter will
raise an exception.  That kind of failure is what I considered to be
an instance of a problem due to the OP's "skips down through any Lock
semantics"; the failure to notify internal conditions that the queue
is empty is a different _kind_ of problem, hence my "also" above.

If you're just asking whether deque.clear() is "atomic enough" on its
own, define "enough" :-)  It has to decref each item in the deque, and
that can end up executing arbitrary Python code (due to __del__
methods or weakref callbacks), and that in turn can allow the GIL to
be released allowing all other threads to run, and any of that
Python-level code may even mutate the deque _while_ it's being
cleared.

The C code in deque_clear() looks threadsafe in the sense that it
won't blow up regardless -- and that's about the strongest that can
ever be said for a clear() method.  dict.clear() is actually a bit
stronger, acting as if a snapshot were taken of the dict's state at
the instant  dict.clear() is called.  Any mutations made to the dict
as a result of decref side effects while the original state is getting
cleared survive.  In contrast, if a new item is added to a deque d
(via decref side effect)  while d.clear() is executing, it won't
survive.  That's "atomic enough" for me, but it is kinda fuzzy.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is Queue.Queue.queue.clear() thread-safe?

2006-06-22 Thread Tim Peters
[Russell Warren]
> I'm guessing no, since it skips down through any Lock semantics,

Good guess :-)  It's also unsafe because some internal conditions must
be notified whenever the queue becomes empty (else you risk deadlock).

> but I'm wondering what the best way to clear a Queue is then.
>
> Esentially I want to do a "get all" and ignore what pops out, but I
> don't want to loop through a .get until empty because that could
> potentially end up racing another thread that is more-or-less blindly
> filling it asynchronously.
>
> Worst case I think I can just borrow the locking logic from Queue.get
> and clear the deque inside this logic, but would prefer to not have to
> write wrapper code that uses mechanisms inside the object that might
> change in the future.

There's simply no defined way to do this now.

> Also - I can of course come up with some surrounding architecture to
> avoid this concern altogether, but a thread-safe Queue clear would do
> the trick and be a nice and short path to solution.
>
> If QueueInstance.queue.clear() isn't thread safe... what would be the
> best way to do it?  Also, if not, why is the queue deque not called
> _queue to warn us away from it?

"Consenting adults" -- if you want an operation that isn't supplied
out of the box, and are willing to take the risk of breaking into the
internals, Python isn't going to stop you from doing whatever you
like.  "mutex" isn't named "_mutex" for the same reason.  A

q.mutex.acquire()
try:
   q.queue.clear()
   q.unfinished_tasks = 0
   q.not_full.notify()
   q.all_tasks_done.notifyAll()
finally:
   q.mutex.release()

sequence probably works (caveat emptor).  BTW, it may be _possible_
you could use the newer task_done()/join() Queue gimmicks in some way
to get what you're after (I'm not really clear on that, exactly).

Anyway, yes, there is cross-release risk in breaking into the
internals like that.  That's an "adult decision" for you to make.  The
only way to get permanent relief is to re-think your approach, or
propose adding a new Queue.clear() method and Queue._clear() default
implementation.  I don't know whether that would be accepted -- it
seems simple enough, although since you're the first person to ask for
it 15 years :-), you won't get much traction arguing that's a critical
lack, and every new bit of cruft becomes an ongoing burden too.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Iteration over recursion?

2006-06-21 Thread Tim Peters
[MTD <[EMAIL PROTECTED]>]
> I've been testing my recursive function against your iterative
> function, and yours is generally a quite steady 50% faster on
> factorizing 2**n +/- 1 for  0 < n < 60.

If you're still not skipping multiples of 3, that should account for most of it.

> I think that, for a challenge, I'll try to make a recursive function that
> matche or beats the iterative function -- it's worth the experiment!

Don't stop on that one before you succeed ;-)  Provided you make no
more than one recursive call per factor, you can't make more than
log(n, 2) recursive calls in total, and _typically_ far fewer than
that.  IOW, the number of recursive calls should generally be trivial,
and factoring 2**n will be the worst case.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: random.jumpahead: How to jump ahead exactly N steps?

2006-06-21 Thread Tim Peters
[Matthew Wilson]
> The random.jumpahead documentation says this:
>
> Changed in version 2.3: Instead of jumping to a specific state, n steps
> ahead, jumpahead(n) jumps to another state likely to be separated by
> many steps..
>
> I really want a way to get to the Nth value in a random series started
> with a particular seed.  Is there any way to quickly do what jumpahead
> apparently used to do?

No known way, and it seems unlikely that any quick way will be found
in my lifetime ;-) for the Mersenne Twister.  In Pythons <= 2.2, the
underlying generator was the algebraically _very_ much simpler
original Wichman-Hill generator, and jumping ahead by exactly n steps
was just a matter of a few modular integer exponentiations.  That took
time proportional to log(n), so was extremely fast.

It was also much more _necessary_ using W-H, since W-H's period was
only around 10**13, while the Twister's period is around 10**6001:
even if you're going to use a billion random numbers per sequence, the
Twister's period has a way-way-beyond astronomical number of
non-overlapping subsequences of that length.  The chance of hitting an
overlap is correspondingly miniscule.

> I devised this function, but I suspect it runs really slowly:

Well, it takes exactly as much time as it takes to call random() n times.

> def trudgeforward(n):
> '''Advance the random generator's state by n calls.'''
> for _ in xrange(n): random.random()
>
> So any speed tips would be very appreciated.

What are you trying to achieve in the end?  Take it as a fact that
there is no known way to advance the Twister by n states faster than
what you've already found.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Iteration over recursion?

2006-06-21 Thread Tim Peters
[Kay Schluehr]
> You might use a separate prime generator to produce prime factors. The
> factorize algorithm becomes quite simple and configurable by prime
> generators.

Alas, yours was _so_ simple that it always takes time proportional to
the largest prime factor of n (which may be n) instead of worst-case
time proportional to sqrt(n).  Repair that, and it's not quite so
simple anymore.  The speed of the sieve generator would also benefit a
lot by never looking at even integers, beyond an initial "yield 2" ...
the OP said he was looking for speed more than elegance, and they're
not good buddies here ;-)

> def eratosthenes():
> memo = {}
> q = 2
> while True:
> p = memo.pop(q, None)
> if p is None:
> yield q
> memo[q*q] = q
> else:
> x = p + q
> while x in memo:
> x += p
> memo[x] = p
> q+=1
>
> def factorize(n, sieve = eratosthenes):
> if n <= 1:
> return [n]
> factors = []
> primes = sieve()
> for q in primes:
> while n % q == 0:
> factors.append(q)
> n //= q
> if n == 1:
> return factors

At _some_ point you might think that's bound to be faster than only
skipping multiples of 2 and 3, because it does fewer divisions.  But
to get to a particular prime p, the sieve there has to go around its
outer loop about 3x as often as the trial() function goes around its
inner loop, and grows a dict with about p/ln(p) entries (while the
trial() function's memory use is constant), and those aren't free.

Applied to 991**2 (constructed so that both functions take time
proportional to sqrt(n)), on my box the factorize() function took 4x
longer than trial() (approximately 16 seconds versus 4).  Trial
division isn't practical for such large inputs, but I picked the
square of a "large" prime to give factorize() maximum advantage in
skipping division attempts (by the time we get to a prime p,
factorize() tries about p/ln(p) divisions, while trial() does about
p/3; so trial() does about ln(p)/3 times as many divisions as
factorize():  the larger the p we need, the larger that _factor_
gets).

Alas, it appears that made the dict so big that cache faults on dict
accesses more than wiped out the division advantage.  At the smaller
83**2, factorize() took only about 3.6x longer, despite losing
some of its relative division advantage.

In short, it's never what you think it is ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python SHA-1 as a method for unique file identification ? [help!]

2006-06-21 Thread Tim Peters
[EP <[EMAIL PROTECTED]>]
> This inquiry may either turn out to be about the suitability of the
> SHA-1 (160 bit digest) for file identification, the sha function in
> Python ... or about some error in my script

It's your script.  Always open binary files in binary mode.  It's a
disaster on Windows if you don't (if you open a file in text mode on
Windows, the OS pretends that EOF occurs at the first instance of byte
chr(26) -- this is an ancient Windows behavior that made an odd kind
of sense in the mists of history, and has persisted in worship of
Backward Compatibility despite that the original reason for it went
away _long_ ago).

...

> I am trying to reduce duplicate files in storage at home - I have a
> large number files (e.g. MP3s) which have been stored on disk multiple
> times under different names or on different paths.

...

> All seemed to be working until I examined my log files and found files
> with the same SHA digest had different sizes according to
> os.stat(fpath).st_size .  This is on Windows XP.
>
> -  Am I expecting too much of SHA-1?

No.  SHA-1 should work well for this.

> -  Is it that the os.stat data on Windows cannot be trusted?

It can be trusted to the extent that anything on Windows can be trusted ;-)

...
> def hashit(pth):
> """Takes a file path and returns a SHA hash of its string"""
> fs=open(pth,'r').read()

Make that 'rb' instead of 'r', and your problem will go away.

Do note that there are faster ways to go about this.  For example, if
a particular file has a unique size among the files you're looking at,
there's no reason to even open that file (let alone compute its hash).
 Group the files by size, and within a size group you can find
duplicates by, e.g., first comparing their initial 16 bytes, then the
next 32, then the next 64 ... etc.  If duplicates are uncommon, this
can be a huge savings.  For example, here's output from an old
dup-finding Python program of mine run over a directory tree which
happens to contain no duplicates:

Files
-
Total   10,718
Unique  10,718
Duplicate0
# w/ unique size10,053
# w/ unique prefix 665

Bytes
-
Total  1,401,668,015
Unique 1,401,668,015
Duplicate  0
Read  76,688
Excess76,688

That last two lines mean it actually read a total of only about 77,000
file bytes on its way to proving there were no duplicates among about
11,000 files spanning about 1.4 gigabytes.

No hashes are computed by that program.  I'd use a hash if instead I
were going to save a persistent (across runs) set of known hashes, so
that answering "here's a new file -- same as one I already have?"
could be done by computing its hash.  While the program above
generally reads "amazingly" few file bytes, it hammers the OS file
directory services, and it would go faster to compute a hash of a new
file than to run the whole analysis again.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Iteration over recursion?

2006-06-20 Thread Tim Peters
[MTD]
> I've been messing about for fun creating a trial division factorizing
> function and I'm naturally interested in optimising it as much as
> possible.
>
> I've been told that iteration in python is generally more
> time-efficient than recursion. Is that true?

Since you heard it from me to begin with, I suppose saying "yes" again
won't really help ;-)  You can use time.time() or time.clock(), or the
`timeit` module, to measure speed.  Try timing a dead-simple factorial
functions both ways.

BTW, I've worked on Python's implementation for close to 15 years now,
so when I say something about Python, it's _usually_ safe to just
believe it :-)  Questioning is fine, but in cases like this where it's
easy to find out for yourself, I probably won't make time to respond.

> Here is my algorithm as it stands. Any suggestions appreciated!
>
> from math import *
>
> def factorize(x):
> """ Return factors of x """
> factors = []
> lim = int(sqrt(x))

Here you're limited to integers for which a floating sqrt is accurate.
 That's 53 bits on most platforms.  For integers larger than that, the
code may produce incorrect results in mysterious ways at times
(because the square root may be inaccurate).

> y = divmod(x,2)

Most people would use "unpacking assignment" here instead, like

q, r = divmod(x, 2)

Then later code gets to use meaningful names instead of messing with
mysterious little integer subscripts.  It's also quicker to use local
names than to build subscripting expressions.

> if y[1] == 0: # x is even
> factors = factors + [2] + factorize(y[0])

Since `factors` upon entry to this block must be [], it's kinda
contorted.  More obvious (if you _have_ to use recursion) would be:

return [2] + factorize(y[0])

and then unindent the rest of the code a level.

> else:   # x is odd
> i = 3
> while i <= lim:
> y = divmod(x,i)

Again more commonly written:

q, r = divmod(x, i)

> if y[1] == 0:
> factors = factors + [i] + factorize(y[0])

It hardly matters in this context, but appending to a list is
_generally_ much faster than building a brand new list from scratch.

At a higher level, your recursions always start from 2 again; e.g., if
i happens to be 434349 here, the factorize(y[0]) call starts over from
2, despite that you already know that no divisor less than 434349 can
succeed.  To do this _efficiently_ with recursion requires passing
more knowledge across recursion levels.

> i = lim+1
> else:
> i += 2
>
> if factors == []: # necessary step for correct recursion
> factors = [x]
>
> return factors

Recursion is pretty bizarre here.  Not _wrong_, just "pretty bizarre",
and is leading you into higher-level inefficiences.  There are all
sorts of ways to repair that, but I'm not going to talk about them
because it's easy to write an iterative algorithm instead.

Here's one way to do that, avoiding float sqrt (this works correctly
for integers of any size, although trial division is hopelessly
_inefficient_ for finding "large" factors), and skipping multiples of
3 in the inner loop too:

def trial(n):
if n <= 1:
return [n]
factors = []
for tiny in 2, 3:
while n % tiny == 0:
factors.append(tiny)
n //= tiny
delta = 2
d = 5
while 1:
q, r = divmod(n, d)
if r:
if q < d:
break
d += delta
delta = 6 - delta
else:
factors.append(d)
n = q
if n > 1:
factors.append(n)
return factors

If we call the original input N, the key invariants holding at the top
of the loop are:

1. N >= 2, and N == n * the product of the integers in `factors`
2. n is not divisible by any prime < d
3. all elements in `factors` are prime

It may take a bit of thought to see that d marches over all integers
>= 5 that aren't divisible by either 2 or 3:  5, 7, 11, 13, 17, 19,
23, 25, 29, 31, ...

It's more-or-less generally true that establishing and proving loop
invariants in an iterative algorithm is "the same thing" as
establishing and proving pre- and post-conditions in a recursive
algorithm.  While it may feel strained at first, it's very much
worthwhile learning how to do that.  For example, from #2 it's easy to
prove that if q < d, n must either be 1 or a prime.  That's what
justifies the "break" statement inside the loop.  Then since that's
the only way to get out of the loop, n is either 1 or a prime when the
loop finishes, and that explains the rest of the code.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: need helping tracking down weird bug in cPickle

2006-06-20 Thread Tim Peters
[Carl J. Van Arsdall]
> Hey everyone, cPickle is raising an ImportError that I just don't quite
> understand.

When that happens, the overwhelmingly most likely cause is that the
set of modules on your PYTHONPATH has changed since the pickle was
first created, in ways such that a module _referenced_ inside the
pickle can no longer be found.

For example, here's file resourcepmdb.py on my box:

def f():
print "hi"

Here's a bit of Python code that uses it:

import cPickle

import resourcepmdb

pf = open("junk.pik", "wb")
pickler = cPickle.Pickler(pf)
pickler.dump([resourcepmdb.f])
pf.close()

Pickle stores module globals like user-defined classes and module
functions (the case above) not by saving the code they contain, but
_just_ by saving strings recording the module's name and the global's
name.  The unpickling process creates an import out of those strings.

So long as resourcepmdb stays on my PYTHONPATH, unpickling junk.pik
works fine.  For example, this code:

import cPickle

pf = open("junk.pik", "rb")
pickler = cPickle.Unpickler(pf)
whatever = pickler.load()
pf.close()
print whatever

prints something like

[]

But if resourcepmdb gets off my path (for example, resourcepmdb.py
gets deleted, renamed, or moved -- or PYTHONPATH itself changes), that
same code dies with exactly the error you reported:

Traceback (most recent call last):
  File "pik2.py", line 5, in ?
whatever = pickler.load()
ImportError: No module named resourcepmdb

> ...
> I've done some research and it looks like cPickle tries to load some
> modules as some kind of test.

That's unlikely ;-)

> From what I can tell the module that cPickle is trying to load is a
> concatenation of the module that it exists in and part of the string
> that was sent in the orignal request.

Sorry, can't tell from here.

> It looks like something where memory is getting overwritten, but I'm not
> really doing anything too fancy, so i'm not sure how to approach this
> problem.  I'm using python2.2.  Does anyone know how I might go about
> trying to troubleshoot something like this?  Is there a good tool or a
> better method of error handling I can use to try and track this down?
> Or is it better to give up troubleshooting and just upgrade to 2.4?

You could at least _install_ a recent Python so you could use the
newer pickletools module.  pickletools.dis() displays a readable
"disassembly" of a pickle, so that there's no need to guess about what
the pickle _thinks_ it's trying to do.  pickletools won't have any
problem working with pickles created by an older Python.

> ...
> Here's the traceback:
>
> Traceback (most recent call last):
>   File "/home/build/bin/resourceManager/resourceQueue.py", line 50, in
> initQueue
> pickledList = cPickle.load(queueFPtr)
> ImportError: No module named resourcepmdb'

See above for the most obvious way to get an error like that.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: PyObject_SetItem(..) *always* requires a Py_INCREF or not?

2006-06-17 Thread Tim Peters
[EMAIL PROTECTED]
> I would think everytime you add an item to a list you must increase
> reference count of that item.

_Someone_ needs to.  When the function called to add the item does the
incref itself, then it would be wrong for the caller to also incref
the item.

> http://docs.python.org/api/refcountDetails.html has an example that
> seems to contradict that
>
> int
> set_all(PyObject *target, PyObject *item)
> {
> int i, n;
>
> n = PyObject_Length(target);
> if (n < 0)
> return -1;
> for (i = 0; i < n; i++) {
> if (PyObject_SetItem(target, i, item) < 0)
> return -1;
> }
> return 0;
> }
>
> *WHY* don't you need a Py_INCREF(item); in the for loop!?!?!?

You should take a break, and read that section again later ;-)

The _point_ of that example is in fact to illustrate that you don't
need to incref when calling PyObject_SetItem().  While I can't know,
I'm guessing that you're still "seeing" PyList_SetItem() here, which
has radically different behavior.  PyList_SetItem() steals a reference
to its second argument, but PyObject_SetItem() does not.  Read the
whole section again from its start, and this should be much clearer
the second time through.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Numerics, NaNs, IEEE 754 and C99

2006-06-15 Thread Tim Peters
[Nick Maclaren]
 Firstly, a FAR more common assumption is that integers wrap in twos'
 complement - Python does not do that.

[Grant Edwards]
>>> It used to

[Fredrik Lundh]
>> for integers ?  what version was that ?

[Grant]
> Am I remebering incorrectly?

Mostly but not entirely.

> Didn't the old fixed-width integers operate modulo-wordsize?

Not in Python.

>  I distinctly remember having to rewrite a bunch of checksum code when
> fixed-width integers went away.

Best guess is that you're working on a 32-bit box, and remember this
Python <= 2.2 behavior specific to the left shift operator:

>>> 1 << 31
-2147483648
>>> 1 << 32
0

On a 64-bit box (except Win64, which didn't exist at the time ;-)),
those returned 2**31 and 2**32 instead, while "1 << 64" wrapped to 0
(etc).

Python 2.3 starting warning about that:

>>> 1 << 31
__main__:1: FutureWarning: x<>> 1 << 31
2147483648L
>>> 1 << 32
4294967296L

+ - * / on short ints always complained about overflow before int-long
unification, although IIRC very early Pythons missed the overflow
error in (on a 32-bit box) int(-(2L**31))/-1.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: math.pow(x,y)

2006-06-11 Thread Tim Peters
[Wojciech Muła]
>> You have to use operator **, i.e. 34564323**456356

Or the builtin pow() instead of math.pow().

[Gary Herron]
> That's not very practical. That computation will produce a value with
> more than 3.4 million digits.

Yes.

> (That is, log10(34564323)*456356 = 3440298.) Python will attempt this, but
> I was not patient enough to see if it could calculate an answer today (or even
> this week).

On my box it took less than 30 seconds to do

x = 34564323**456356

If you try to _display_ that as a decimal string, it will take
enormously longer.  Python uses a power-of-2 base internally, and
conversion to decimal takes time quadratic in the number of digits.
Doing y = hex(x) instead is very fast (small fraction of a second).

> I doubt that you really *want* all 3.4 million digits. So what is it you
> really want? A scientific or engineering result as a floating point
> number accurate to some reasonable number of digits? That integer value
> modulo some other integer (as used in various cryptology schemes)?

For example, if you only want the last 6 digits, pow(34564323, 456356,
100) returns 986961 in an eyeblink.
-- 
http://mail.python.org/mailman/listinfo/python-list


  1   2   3   4   >