Re: The reliability of python threads

2007-02-01 Thread Steve Holden
Carl J. Van Arsdall wrote:
 Steve Holden wrote:
 [snip]

 Are you using memory with built-in error detection and correction?

   
 You mean in the hardware?  I'm not really sure, I'd assume so but is 
 there any way I can check on this?  If the hardware isn't doing that, is 
 there anything I can do with my software to offer more stability?
 
You might be able to check using the OS features (have you said what OS 
you are using?) - alternatively Google for information from the system 
supplier.

If you don't have that feature in hardware you are up sh*t creek without 
a paddle, as it can't be emulated.

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC/Ltd  http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note:  http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-30 Thread John Nagle
Aahz wrote:
 In article [EMAIL PROTECTED],
 Carl J. Van Arsdall [EMAIL PROTECTED] wrote:
 My point is that an app that dies only once every few months under load
 is actually pretty damn stable!  That is not the kind of problem that
 you are likely to stimulate.

 This has all been so vague.  How does it die?

 It would be useful if Python detected obvious deadlock.  If all threads
are blocked on mutexes, you're stuck, and at that point, it's time
to abort and do tracebacks on all threads.   You shouldn't have to
run under a debugger to detect that.

 Then a timer, so that if the Global Python Lock
stays locked for more than N seconds, you get an abort and a traceback.
That way, if you get stuck in some C library, it gets noticed.

 Those would be some good basic facilities to have in thread support.

 In real-time work, you usually have a high-priority thread which
wakes up periodically and checks that a few flags have been set
indicating progress of the real time work, then clears the flags.
Throughout the real time code, flags are set indicating progress
for the checking thread to notice.  All serious real time systems
have some form of stall timer like that; there's often a stall
timer in hardware.

John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-30 Thread Carl J. Van Arsdall
Steve Holden wrote:
 [snip]

 Are you using memory with built-in error detection and correction?

   
You mean in the hardware?  I'm not really sure, I'd assume so but is 
there any way I can check on this?  If the hardware isn't doing that, is 
there anything I can do with my software to offer more stability?





-- 

Carl J. Van Arsdall
[EMAIL PROTECTED]
Build and Release
MontaVista Software

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-30 Thread Carl J. Van Arsdall
John Nagle wrote:
 Aahz wrote:
   
 In article [EMAIL PROTECTED],
 Carl J. Van Arsdall [EMAIL PROTECTED] wrote:
 My point is that an app that dies only once every few months under load
 is actually pretty damn stable!  That is not the kind of problem that
 you are likely to stimulate.
 

  This has all been so vague.  How does it die?
   
Well, before operating on most of the data I perform type checks, if the 
type check fails, my system flags an exception.  Now i'm in the process 
of finding out how the data went bad.  I gotta wait at this point 
though, so I was investigating possibilities so I could find a new way 
of throwing the kitchen sink at it.


  It would be useful if Python detected obvious deadlock.  If all threads
 are blocked on mutexes, you're stuck, and at that point, it's time
 to abort and do tracebacks on all threads.   You shouldn't have to
 run under a debugger to detect that.

  Then a timer, so that if the Global Python Lock
 stays locked for more than N seconds, you get an abort and a traceback.
 That way, if you get stuck in some C library, it gets noticed.

  Those would be some good basic facilities to have in thread support.
   
I agree.  That would be incredibly useful.  Although doesn't this spark 
up the debate on threads killing threads?  From what I understand, this 
is frowned upon (and was removed from java because it was dangerous).  
Although I think that if there was a master or control thread that 
watched the state of the system and could intervene, that would be 
powerful.  One way to do this could be to use processes, and each 
process could catch a kill signal if it appears to be stalled, although 
I am absolutely sure there is more to it than that.  I don't think this 
could be done at all with python threads though, but as a fan of python 
threads and their ease of use, it would be a nice and powerful feature 
to have.


-carl


-- 

Carl J. Van Arsdall
[EMAIL PROTECTED]
Build and Release
MontaVista Software

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-29 Thread Carl J. Van Arsdall
Hendrik van Rooyen wrote:
 [snip]
 could definitely do more of them.  The thing will be 
 

 When I read this - I thought - probably your stuff is working 
 perfectly - on your test cases - you could try to send it some
 random data and to see what happens - seeing as you have a test 
 server, throw the kitchen sink at it.

 Possibly random here means something that looks like data
 but that is malformed in some way. Kind of try to trick the 
 system to get it to break reliably.

 I'm sorry I can't be more specific - it sounds so weak, and you
 probably already have test cases that must fail but I don't 
 know how to put it any better...
   
Well, sometimes a weak analogy is the best thing because it allows me to 
fill in the blanks How can I throw a kitchen sink at it in a way I 
never have before

And away my mind goes, so thank you.

-carl

-- 

Carl J. Van Arsdall
[EMAIL PROTECTED]
Build and Release
MontaVista Software

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-29 Thread Aahz
In article [EMAIL PROTECTED],
Carl J. Van Arsdall [EMAIL PROTECTED] wrote:
Aahz wrote:

 My response is that you're asking the wrong questions here.  Our database
 server locked up hard Sunday morning, and we still have no idea why (the
 machine itself, not just the database app).  I think it's more important
 to focus on whether you have done all that is reasonable to make your
 application reliable -- and then put your efforts into making your app
 recoverable.
   
Well, I assume that I have done all I can to make it reliable.  This 
list is usually my last resort, or a place where I come hoping to find 
ideas that aren't coming to me naturally.  The only other thing I 
thought to come up with was that there might be network errors.  But 
i've gone back and forth on that, because TCP should handle that for me 
and I shouldn't have to deal with it directly in pyro, although I've 
added (and continue to add) checks in places that appear appropriate 
(and in some cases, checks because I prefer to be paranoid about errors).

My point is that an app that dies only once every few months under load
is actually pretty damn stable!  That is not the kind of problem that
you are likely to stimulate.

 I'm particularly making this comment in the context of your later point
 about the bug showing up only every three or four months.

 Side note: without knowing what error messages you're getting, there's
 not much anybody can say about your programs or the reliability of
 threads for your application.
   
Right, I wasn't coming here to get someone to debug my app, I'm just 
looking for ideas.  I constantly am trying to find new ways to improve 
my software and new ways to reduce bugs, and when i get really stuck, 
new ways to track bugs down.  The exception won't mean much, but I can 
say that the error appears to me as bad data.  I do checks prior to 
performing actions on any data, if the data doesn't look like what it 
should look like, then the system flags an exception.

The problem I'm having is determining how the data went bad.  In 
tracking down the problem a couple guys mentioned that problems like 
that usually are a race condition.  From here I examined my code, 
checked out all the locking stuff, made sure it was good, and wasn't 
able to find anything.  Being that there's one lock and the critical 
sections are well defined, I'm having difficulty.  One idea I have to 
try and get a better understanding might be to check data before its 
stored.  Again, I still don't know how it would get messed up nor can I 
reproduce the error on my own. 

Do any of you think that would be a good practice for trying to track 
this down? (Check the data after reading it, check the data before 
saving it)

What we do at my company is maintain log files.  When we think we have
identified a potential choke point for problems, we add a log call.
Tracking this down will involve logging the changes to your data until
you can figure out where it goes wrong -- once you know where it goes
wrong, you have an excellent chance of figuring out why.
-- 
Aahz ([EMAIL PROTECTED])   * http://www.pythoncraft.com/

I disrespectfully agree.  --SJM
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-29 Thread Steve Holden
Carl J. Van Arsdall wrote:
 Aahz wrote:
 [snip]

 My response is that you're asking the wrong questions here.  Our database
 server locked up hard Sunday morning, and we still have no idea why (the
 machine itself, not just the database app).  I think it's more important
 to focus on whether you have done all that is reasonable to make your
 application reliable -- and then put your efforts into making your app
 recoverable.
   
 Well, I assume that I have done all I can to make it reliable.  This 
 list is usually my last resort, or a place where I come hoping to find 
 ideas that aren't coming to me naturally.  The only other thing I 
 thought to come up with was that there might be network errors.  But 
 i've gone back and forth on that, because TCP should handle that for me 
 and I shouldn't have to deal with it directly in pyro, although I've 
 added (and continue to add) checks in places that appear appropriate 
 (and in some cases, checks because I prefer to be paranoid about errors).
 
 
 I'm particularly making this comment in the context of your later point
 about the bug showing up only every three or four months.

 Side note: without knowing what error messages you're getting, there's
 not much anybody can say about your programs or the reliability of
 threads for your application.
   
 Right, I wasn't coming here to get someone to debug my app, I'm just 
 looking for ideas.  I constantly am trying to find new ways to improve 
 my software and new ways to reduce bugs, and when i get really stuck, 
 new ways to track bugs down.  The exception won't mean much, but I can 
 say that the error appears to me as bad data.  I do checks prior to 
 performing actions on any data, if the data doesn't look like what it 
 should look like, then the system flags an exception.
 
 The problem I'm having is determining how the data went bad.  In 
 tracking down the problem a couple guys mentioned that problems like 
 that usually are a race condition.  From here I examined my code, 
 checked out all the locking stuff, made sure it was good, and wasn't 
 able to find anything.  Being that there's one lock and the critical 
 sections are well defined, I'm having difficulty.  One idea I have to 
 try and get a better understanding might be to check data before its 
 stored.  Again, I still don't know how it would get messed up nor can I 
 reproduce the error on my own. 
 
 Do any of you think that would be a good practice for trying to track 
 this down? (Check the data after reading it, check the data before 
 saving it)
 
Are you using memory with built-in error detection and correction?

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC/Ltd  http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note:  http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-27 Thread Hendrik van Rooyen
 Carl J. Van Arsdall [EMAIL PROTECTED] wrote:
 Hendrik van Rooyen wrote:
   Carl J. Van Arsdall [EMAIL PROTECTED] wrote:

8 ---


 Yea, I do some of that too.  I use that with conditional print 
 statements to stderr when i'm doing my validation against my test 
 cases.  But I could definitely do more of them.  The thing will be 

When I read this - I thought - probably your stuff is working 
perfectly - on your test cases - you could try to send it some
random data and to see what happens - seeing as you have a test 
server, throw the kitchen sink at it.

Possibly random here means something that looks like data
but that is malformed in some way. Kind of try to trick the 
system to get it to break reliably.

I'm sorry I can't be more specific - it sounds so weak, and you
probably already have test cases that must fail but I don't 
know how to put it any better...

- Hendrik


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-26 Thread Nick Maclaren

In article [EMAIL PROTECTED],
[EMAIL PROTECTED] writes:
| 
| What makes you think Paddy indicated he wouldn't try to solve the problem?
| Here's what he wrote:
| 
| What I'm proposing is that if, for example, a process stops running
| three times in a year at roughly three to four months intervals , and it
| should have stayed up; then restart the server sooner, at aa time of
| your choosing, whilst taking other measures to investicate the error.
| 
| I see nothing wrong with trying to minimize the chances of a problem rearing
| its ugly head while at the same time trying to investigate its cause (and
| presumably solve it).

No, nor do I, but look more closely.  His quote makes it quite clear that
he has got it firmly in his mind that this is a degradation problem, and
so regular restarting will improve the reliability.  Well, it could also
be one where failure becomes LESS likely the longer the server stays up
(i.e. the settling down problem).

No problem is as hard to find as one where you are firmly convinced that
it is somewhere other than where it is.


Regards,
Nick Maclaren.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-26 Thread Paddy


On 26 Jan, 09:05, [EMAIL PROTECTED] (Nick Maclaren) wrote:
 In article [EMAIL PROTECTED],[EMAIL PROTECTED] writes:|
 | What makes you think Paddy indicated he wouldn't try to solve the problem?
 | Here's what he wrote:
 |
 | What I'm proposing is that if, for example, a process stops running
 | three times in a year at roughly three to four months intervals , and 
 it
 | should have stayed up; then restart the server sooner, at aa time of
 | your choosing, whilst taking other measures to investicate the error.
 |
 | I see nothing wrong with trying to minimize the chances of a problem 
 rearing
 | its ugly head while at the same time trying to investigate its cause (and
 | presumably solve it).

 No, nor do I, but look more closely.  His quote makes it quite clear that
 he has got it firmly in his mind that this is a degradation problem, and
 so regular restarting will improve the reliability.  Well, it could also
 be one where failure becomes LESS likely the longer the server stays up
 (i.e. the settling down problem).

 If in the past year the settling down problem did not rear its head
when the server crashed after three to four months and was restarted,
then why not implement a regular ,  notified, downtime  - whilst also
looking into the problem in more depth?

* You are already having to restart.
* restarts last for 3-4 months.
Why burden yourself with Oh but it could fail once in three hours,
you've not prooved that it can't, we'll have to stop everything whilst
we do a thorough investigation. Is it Poisson? Is it 'settling down'?
Just wait whilst I prepare my next doctoral thesis... 

- Okay, the last was extreme. but cathartic :-)

 No problem is as hard to find as one where you are firmly convinced that
 it is somewhere other than where it is.
Amen!
 
 Regards,
 Nick Maclaren.
- Paddy.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-26 Thread Carl J. Van Arsdall
Hendrik van Rooyen wrote:
  Carl J. Van Arsdall [EMAIL PROTECTED] wrote:

   
 [snip]
 

 Are you 100% rock bottom gold plated guaranteed sure that there is
 not something else that is also critical that you just haven't realised is?
   
100%?  No, definitely not.  I know myself, as I explore this option and 
other options, I will of course be going into and out of the code, 
looking for that small piece I might have missed.  But I'm like a modern 
operating system, I do lots of things at once.  So after being unable to 
solve it the first few times, I thought to pose a question, but as I 
pose the question that never means that I'm done looking at my code and 
hoping I missed something.  I'd much rather have this be my fault... 
that means I have a much higher probability of fixing it.  But i sought 
to explore some tips given to me.   Ah, but the day I could be 100% 
sure, that would be a good day (hell, i'd go ask for a raise for being 
the best coder ever!)

 This stuff is never obvious before the fact - and always seems stupid
 afterward, when you have found it.  Your best (some would say only)
 weapon is your imagination, fueled by scepticism...

   
Yea, seriously!


 try and get a better understanding might be to check data before its 
 stored.  Again, I still don't know how it would get messed up nor can I 
 reproduce the error on my own. 

 Do any of you think that would be a good practice for trying to track 
 this down? (Check the data after reading it, check the data before 
 saving it)
 

 Nothing wrong with doing that to find a bug - not as a general 
 practice, of course - that would be too pessimistic.

 In hard to find bugs - doing anything to narrow the time and place
 of the error down is fair game - the object is to get you to read
 some code that you *know works* with new eyes...

   
I really like that piece of wisdom, I'll add that to my list of coding 
mantras.  Thanks!

 I build in a global boolean variable that I call trace, and when its on
 I do all sort of weird stuff, giving a running commentary (either by
 print or in some log like file) of what the programme is doing, 
 like read this, wrote that, received this, done that here, etc.
 A bare useful minimum is a we get here indicator like the routine
 name, but the data helps a lot too.

   
Yea, I do some of that too.  I use that with conditional print 
statements to stderr when i'm doing my validation against my test 
cases.  But I could definitely do more of them.  The thing will be 
simulating the failure.  In the production server, thousands of printed 
messages would be bad. 

I've done short but heavy simulations, but to no avail.  For example, 
I'll have a couple systems infinitely loop and beat on the system.  This 
is a much heavier load than the system will ever normally face, as its 
hit a lot at once and then idles for a while.  The test environment 
constantly hits it, and I let that run for several days.  Maybe a longer 
run is needed, but how long is reasonable before determining that its 
something beyond my control?

 Compared to an assert, it does not stop the execution, and you
 could get lucky by cross correlating such traces from different
 threads. - or better, if you use a queue or a pipe for the log, 
 you might see the timing relationships directly.
   
Ah, store the logs in a rotating queue of fixed size?  That  would work 
pretty well to maintain control on a large run, thanks!

 But this in itself is fraught with danger, as you can hit file size 
 limits, or slow the whole thing down to unusability.

 On the other hand it does not generate the volume that a genuine 
 trace does, it is easier to read, and you can limit it to the bits that
 you are currently suspicious of.

 Programming is such fun...
   
Yea, I'm one of those guys who really gets a sense of satisfaction out 
of coding.  Thanks for the tips.

-carl

-- 

Carl J. Van Arsdall
[EMAIL PROTECTED]
Build and Release
MontaVista Software

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-25 Thread Nick Maclaren

In article [EMAIL PROTECTED],
Paddy [EMAIL PROTECTED] writes:
| 
| Three to four months before `strange errors`? I'd spend some time
| correlating logs; not just for your program, but for everything running
| on the server. Then I'd expect to cut my losses and arrange to safely
| re-start the program every TWO months.
| (I'd arrange the re-start after collecting logs but before their
| analysis. Life is too short).

Forget it.  That strategy is fine in general, but is a waste of time
where threading issues are involved (or signal handling, or some types
of communication problem, for that matter).  There are three unrelated
killer facts that interact:

Such failures are usually probabilistic (Poisson process), and
so have no history.

The expected number is usually proportional to the square of the
activity, sometimes a higher power.

Virtually nothing involved does any routine logging, or even has
options to log relevant events.

The first means that the strategy of restarting doesn't help.  All
three mean that current logs are almost never any use.


Regards,
Nick Maclaren.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-25 Thread Paddy


On Jan 25, 9:26 am, [EMAIL PROTECTED] (Nick Maclaren) wrote:
 In article [EMAIL PROTECTED],Paddy [EMAIL PROTECTED] writes:|
 | Three to four months before `strange errors`? I'd spend some time
 | correlating logs; not just for your program, but for everything running
 | on the server. Then I'd expect to cut my losses and arrange to safely
 | re-start the program every TWO months.
 | (I'd arrange the re-start after collecting logs but before their
 | analysis. Life is too short).

 Forget it.  That strategy is fine in general, but is a waste of time
 where threading issues are involved (or signal handling, or some types
 of communication problem, for that matter).

Nah, Its a great strategy. it keeps you up and running when all you
know for sure is that you will most likely be able to keep things
together for three months normally.
The OP only thinks its a threading problem - it doesn't matter what the
true fix will be, as long as arranging to re-start the server well
before its likely to go down doesn't take too long, compared to your
exploration of the problem, and, of course, you have to be able to
afford the glitch in availability.

 There are three unrelated
 killer facts that interact:

 Such failures are usually probabilistic (Poisson process), and
 so have no history.

 The expected number is usually proportional to the square of the
 activity, sometimes a higher power.

 Virtually nothing involved does any routine logging, or even has
 options to log relevant events.

 The first means that the strategy of restarting doesn't help.  All
 three mean that current logs are almost never any use.
 
 Regards,
 Nick Maclaren.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-25 Thread Carl J. Van Arsdall
Aahz wrote:
 [snip]

 My response is that you're asking the wrong questions here.  Our database
 server locked up hard Sunday morning, and we still have no idea why (the
 machine itself, not just the database app).  I think it's more important
 to focus on whether you have done all that is reasonable to make your
 application reliable -- and then put your efforts into making your app
 recoverable.
   
Well, I assume that I have done all I can to make it reliable.  This 
list is usually my last resort, or a place where I come hoping to find 
ideas that aren't coming to me naturally.  The only other thing I 
thought to come up with was that there might be network errors.  But 
i've gone back and forth on that, because TCP should handle that for me 
and I shouldn't have to deal with it directly in pyro, although I've 
added (and continue to add) checks in places that appear appropriate 
(and in some cases, checks because I prefer to be paranoid about errors).


 I'm particularly making this comment in the context of your later point
 about the bug showing up only every three or four months.

 Side note: without knowing what error messages you're getting, there's
 not much anybody can say about your programs or the reliability of
 threads for your application.
   
Right, I wasn't coming here to get someone to debug my app, I'm just 
looking for ideas.  I constantly am trying to find new ways to improve 
my software and new ways to reduce bugs, and when i get really stuck, 
new ways to track bugs down.  The exception won't mean much, but I can 
say that the error appears to me as bad data.  I do checks prior to 
performing actions on any data, if the data doesn't look like what it 
should look like, then the system flags an exception.

The problem I'm having is determining how the data went bad.  In 
tracking down the problem a couple guys mentioned that problems like 
that usually are a race condition.  From here I examined my code, 
checked out all the locking stuff, made sure it was good, and wasn't 
able to find anything.  Being that there's one lock and the critical 
sections are well defined, I'm having difficulty.  One idea I have to 
try and get a better understanding might be to check data before its 
stored.  Again, I still don't know how it would get messed up nor can I 
reproduce the error on my own. 

Do any of you think that would be a good practice for trying to track 
this down? (Check the data after reading it, check the data before 
saving it)



-- 

Carl J. Van Arsdall
[EMAIL PROTECTED]
Build and Release
MontaVista Software

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-25 Thread Nick Maclaren

In article [EMAIL PROTECTED],
Paddy [EMAIL PROTECTED] writes:
| 
|  | Three to four months before `strange errors`? I'd spend some time
|  | correlating logs; not just for your program, but for everything running
|  | on the server. Then I'd expect to cut my losses and arrange to safely
|  | re-start the program every TWO months.
|  | (I'd arrange the re-start after collecting logs but before their
|  | analysis. Life is too short).
| 
|  Forget it.  That strategy is fine in general, but is a waste of time
|  where threading issues are involved (or signal handling, or some types
|  of communication problem, for that matter).
| 
| Nah, Its a great strategy. it keeps you up and running when all you
| know for sure is that you will most likely be able to keep things
| together for three months normally.
| 
| The OP only thinks its a threading problem - it doesn't matter what the
| true fix will be, as long as arranging to re-start the server well
^
| before its likely to go down doesn't take too long, compared to your
   
| exploration of the problem, and, of course, you have to be able to
| afford the glitch in availability.

Consider the marked phrase in the context of a Poisson process failure
model, and laugh.  If you don't understand why I say that, I suggest
finding out the properties of the Poisson process!


Regards,
Nick Maclaren.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-25 Thread Paddy


On Jan 25, 7:36 pm, [EMAIL PROTECTED] (Nick Maclaren) wrote:
 In article [EMAIL PROTECTED],Paddy [EMAIL PROTECTED] writes:|
 |  | Three to four months before `strange errors`? I'd spend some time
 |  | correlating logs; not just for your program, but for everything 
 running
 |  | on the server. Then I'd expect to cut my losses and arrange to safely
 |  | re-start the program every TWO months.
 |  | (I'd arrange the re-start after collecting logs but before their
 |  | analysis. Life is too short).
 | 
 |  Forget it.  That strategy is fine in general, but is a waste of time
 |  where threading issues are involved (or signal handling, or some types
 |  of communication problem, for that matter).
 |
 | Nah, Its a great strategy. it keeps you up and running when all you
 | know for sure is that you will most likely be able to keep things
 | together for three months normally.
 |
 | The OP only thinks its a threading problem - it doesn't matter what the
 | true fix will be, as long as arranging to re-start the server well
 ^
 | before its likely to go down doesn't take too long, compared to your

 | exploration of the problem, and, of course, you have to be able to
 | afford the glitch in availability.

 Consider the marked phrase in the context of a Poisson process failure
 model, and laugh.  If you don't understand why I say that, I suggest
 finding out the properties of the Poisson process!

 Regards,
 Nick Maclaren.
No, you should think of the service that needs to be up. You seem to be
talking about how it can't be fixed rather than looking for ways to
keep things going. A little learning is fine but it can't
theoretically be fixed is no solution.
With a program that stays up for that long, the situation will usualy
work out for the better when either software versions are upgraded, or
OS and drivers are upgraded. (Sometimes as a result of the analysis,
sometimes not).

Keep your eye on the goal and your more likely to score!

- Paddy.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-25 Thread Paul Rubin
Paddy [EMAIL PROTECTED] writes:
 No, you should think of the service that needs to be up. You seem to be
 talking about how it can't be fixed rather than looking for ways to
 keep things going.

But you're proposing cargo cult programming.  There is no reason
whatsoever to expect that restarting the server now and then will help
the problem in the slightest.  Nick used the fancy term Poisson
process but it just means that the probability of failure at any
moment is independent of what's happened in the past, like the
spontaneous radioactive decay of an atom.  It's not like a mechanical
system where some part gradually gets worn out and eventually breaks,
so you can prevent the failure by replacing the part every so often.

 A little learning is fine but it can't theoretically be fixed is
 no solution.

The best you can do is identify the unfixable situations precisely and
work around them.  Precision is important.

The next best thing is have several servers running simultaneously,
with failure detection and automatic failover.  

If a server is failing at random every few months, trying to prevent
that by restarting it every so often is just shooting in the dark.
Think of your server stopping now and then because there's a power
failure, where you get power failures every few months on the average.
Shutting down your server once a month, unplugging it, and plugging it
back in will do nothing to prevent those outages.  You need to either
identify and fix whatever is causing the power outages, or install a
backup generator.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-25 Thread Paddy


On Jan 25, 8:00 pm, Paul Rubin http://[EMAIL PROTECTED] wrote:
 Paddy [EMAIL PROTECTED] writes:
  No, you should think of the service that needs to be up. You seem to be
  talking about how it can't be fixed rather than looking for ways to
  keep things going.
 But you're proposing cargo cult programming.
i don't know that term. What I'm proposing is that if, for example, a
process stops running three times in a year at roughly three to four
months intervals , and it should have stayed up; then restart the
server sooner, at aa time of your choosing, whilst taking other
measures to investicate the error.
 There is no reason
 whatsoever to expect that restarting the server now and then will help
 the problem in the slightest.
Thats where we most likely differ. The problem is only indirecctly the
program failing. the customer wants reliable service. Which you can get
from unreliable components. It happens all the time in firmware
controlled systems that periodically reboot themselves as a matter of
course.
 Nick used the fancy term Poisson
 process but it just means that the probability of failure at any
 moment is independent of what's happened in the past, like the
 spontaneous radioactive decay of an atom.  It's not like a mechanical
 system where some part gradually gets worn out and eventually breaks,
 so you can prevent the failure by replacing the part every so often.
Whilst you sit agreeing on how many fairys can dance on the end of a
pin or not Your company could be loosing customers. You and Nick seem
to be saying it *must* be Poisson, therefore we can't do...

  A little learning is fine but it can't theoretically be fixed is
  no solution.The best you can do is identify the unfixable situations 
  precisely and
 work around them.  Precision is important.
I'm sorry, but your argument reminds me of when Western statistical
quality control first met with the Japanese Zero defects methodologies.
We had argued ourselves into accepting a certain amount of defective
cars getting out to customers as the result of our theories. The
Japanese practices emphasized *no* defects were acceptable at the
customer, and they seemed to deliver better made cars.

 The next best thing is have several servers running simultaneously,
 with failure detection and automatic failover.
Yah, finally. I can work with that

 If a server is failing at random every few months, trying to prevent
 that by restarting it every so often is just shooting in the dark.
at random - every few months
Me thinking it happens every few months allows me to search for a
fix.
If thinking it happens at random leads you to a brick wall, then
switch!

 Think of your server stopping now and then because there's a power
 failure, where you get power failures every few months on the average.
 Shutting down your server once a month, unplugging it, and plugging it
 back in will do nothing to prevent those outages.  You need to either
 identify and fix whatever is causing the power outages, or install a
 backup generator.

Yep. I also know that a mad bloke entering the server room with a
hammer every three to four months is also not likely to be fixed by
restarting the server every two months ;-)

- Paddy.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-25 Thread Paul Rubin
Paddy [EMAIL PROTECTED] writes:
  But you're proposing cargo cult programming.
 i don't know that term.

http://en.wikipedia.org/wiki/Cargo_cult_programming

 What I'm proposing is that if, for example, a process stops running
 three times in a year at roughly three to four months intervals ,
 and it should have stayed up; then restart the server sooner, at aa
 time of your choosing,

What makes you think that restarting the server will make it less
likely to fail?  It sounds to me like there's zero evidence of that,
since you say roughly three or four month intervals and talk about
threading and race conditions.  If it's failing every 3 months, 15
days and 2.43 hours like clockwork, that's different, sure, restart it
every three months.  But the description I see so far sounds like a
random failure caused by some events occurring with low enough
probability that they only happen on average every few months of
operation.  That kind of thing is very common and is often best
diagnosed by instrumenting the hell out of the code.

  There is no reason whatsoever to expect that restarting the server
  now and then will help the problem in the slightest.
 Thats where we most likely differ.

Do you think there is a reason to expect that restarting the server
will help the problem in the slightest?  I realize you seem to expect
that, but you have not given a REASON.  That's what I mean by cargo
cult programming.

 Whilst you sit agreeing on how many fairys can dance on the end of a
 pin or not Your company could be loosing customers. You and Nick seem
 to be saying it *must* be Poisson, therefore we can't do...

I dunno about Nick, I'm saying it's best to assume that it's Poisson
and do whatever is necessary to diagnose and fix the bug, and that the
voodoo measure you're proposing is not all that likely to help and it
will take years to find out whether it helps or not (i.e. restarting
after 3 months and going another 3 months without a failure proves
nothing).

 I'm sorry, but your argument reminds me of when Western statistical
 quality control first met with the Japanese Zero defects methodologies.
 We had argued ourselves into accepting a certain amount of defective
 cars getting out to customers as the result of our theories. The
 Japanese practices emphasized *no* defects were acceptable at the
 customer, and they seemed to deliver better made cars.

I don't see your point.  You're the one who wants to keep operating
defective software instead of fixing it.

 at random - every few months
 Me thinking it happens every few months allows me to search for a
 fix.  If thinking it happens at random leads you to a brick wall,
 then switch!

But you need evidence before you can say it happens every few months.
Do you have, say, a graph of the exact dates and times of failure, the
number of requests processed so far, etc.?  If it happened at some
exact or almost exact uniform time interval or precisely once every
1.273 million requests or whatever, that tells you something.  But the
earlier description didn't sound like that.  Restarting the server is
not much better than carrying a lucky rabbit's foot.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-25 Thread Nick Maclaren

In article [EMAIL PROTECTED],
Paddy [EMAIL PROTECTED] writes:
| 
| No, you should think of the service that needs to be up. You seem to be
| talking about how it can't be fixed rather than looking for ways to
| keep things going. A little learning is fine but it can't
| theoretically be fixed is no solution.

I suggest that you do invest in a little learning and look up Poisson
processes.

| Keep your eye on the goal and your more likely to score!

And, if you have your eye on the wrong goal, you would generally be
better off not scoring :-)


Regards,
Nick Maclaren.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-25 Thread skip

Paul I dunno about Nick, I'm saying it's best to assume that it's
Paul Poisson and do whatever is necessary to diagnose and fix the bug,
Paul and that the voodoo measure you're proposing is not all that
Paul likely to help and it will take years to find out whether it helps
Paul or not (i.e. restarting after 3 months and going another 3 months
Paul without a failure proves nothing).

What makes you think Paddy indicated he wouldn't try to solve the problem?
Here's what he wrote:

What I'm proposing is that if, for example, a process stops running
three times in a year at roughly three to four months intervals , and it
should have stayed up; then restart the server sooner, at aa time of
your choosing, whilst taking other measures to investicate the error.

I see nothing wrong with trying to minimize the chances of a problem rearing
its ugly head while at the same time trying to investigate its cause (and
presumably solve it).

Skip

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-25 Thread Paul Rubin
[EMAIL PROTECTED] writes:
 What makes you think Paddy indicated he wouldn't try to solve the problem?
 Here's what he wrote:
 
 What I'm proposing is that if, for example, a process stops running
 three times in a year at roughly three to four months intervals , and it
 should have stayed up; then restart the server sooner, at aa time of
 your choosing, whilst taking other measures to investicate the error.

Well, ok, that's better than just rebooting every so often and leaving
it at that, like the firmware systems he cited.

 I see nothing wrong with trying to minimize the chances of a problem

I think a measure to minimize the chance of some problem is only valid
if there's some plausible theory that it WILL decrease the chance of
the problem (e.g. if there's reason to think that the problem is
caused by a very slow resource leak, but that hasn't been suggested).
That's the part that I'm missing from this story.

One thing I'd certainly want to do is set up a test server under a
much heavier load than the real server sees, and check whether the
problem occurs faster.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-25 Thread Hendrik van Rooyen
 Carl J. Van Arsdall [EMAIL PROTECTED] wrote:

 Right, I wasn't coming here to get someone to debug my app, I'm just 
 looking for ideas.  I constantly am trying to find new ways to improve 
 my software and new ways to reduce bugs, and when i get really stuck, 
 new ways to track bugs down.  The exception won't mean much, but I can 
 say that the error appears to me as bad data.  I do checks prior to 
 performing actions on any data, if the data doesn't look like what it 
 should look like, then the system flags an exception.
 
 The problem I'm having is determining how the data went bad.  In 
 tracking down the problem a couple guys mentioned that problems like 
 that usually are a race condition.  From here I examined my code, 
 checked out all the locking stuff, made sure it was good, and wasn't 
 able to find anything.  Being that there's one lock and the critical 
 sections are well defined, I'm having difficulty.  One idea I have to 

Are you 100% rock bottom gold plated guaranteed sure that there is
not something else that is also critical that you just haven't realised is?

This stuff is never obvious before the fact - and always seems stupid
afterward, when you have found it.  Your best (some would say only)
weapon is your imagination, fueled by scepticism...

 try and get a better understanding might be to check data before its 
 stored.  Again, I still don't know how it would get messed up nor can I 
 reproduce the error on my own. 
 
 Do any of you think that would be a good practice for trying to track 
 this down? (Check the data after reading it, check the data before 
 saving it)

Nothing wrong with doing that to find a bug - not as a general 
practice, of course - that would be too pessimistic.

In hard to find bugs - doing anything to narrow the time and place
of the error down is fair game - the object is to get you to read
some code that you *know works* with new eyes...

I build in a global boolean variable that I call trace, and when its on
I do all sort of weird stuff, giving a running commentary (either by
print or in some log like file) of what the programme is doing, 
like read this, wrote that, received this, done that here, etc.
A bare useful minimum is a we get here indicator like the routine
name, but the data helps a lot too.

Compared to an assert, it does not stop the execution, and you
could get lucky by cross correlating such traces from different
threads. - or better, if you use a queue or a pipe for the log, 
you might see the timing relationships directly.

But this in itself is fraught with danger, as you can hit file size 
limits, or slow the whole thing down to unusability.

On the other hand it does not generate the volume that a genuine 
trace does, it is easier to read, and you can limit it to the bits that
you are currently suspicious of.

Programming is such fun...

hth - Hendrik



-- 
http://mail.python.org/mailman/listinfo/python-list


The reliability of python threads

2007-01-24 Thread Carl J. Van Arsdall
Hey everyone, I have a question about python threads.  Before anyone 
goes further, this is not a debate about threads vs. processes, just a 
question.

With that, are python threads reliable?  Or rather, are they safe?  I've 
had some strange errors in the past, I use threading.lock for my 
critical sections, but I wonder if that is really good enough.

Does anyone have any conclusive evidence that python threads/locks are 
safe or unsafe?

Thanks,

Carl

-- 

Carl J. Van Arsdall
[EMAIL PROTECTED]
Build and Release
MontaVista Software

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Nick Maclaren

In article [EMAIL PROTECTED],
Carl J. Van Arsdall [EMAIL PROTECTED] writes:
| Hey everyone, I have a question about python threads.  Before anyone 
| goes further, this is not a debate about threads vs. processes, just a 
| question.
| 
| With that, are python threads reliable?  Or rather, are they safe?  I've 
| had some strange errors in the past, I use threading.lock for my 
| critical sections, but I wonder if that is really good enough.
| 
| Does anyone have any conclusive evidence that python threads/locks are 
| safe or unsafe?

Unsafe.  They are built on top of unsafe primitives (POSIX, Microsoft
etc.)  Python will shield you from some problems, but not all.

There is precious little that you can do, because the root cause is
that the standards and specifications are hopelessly flawed.


Regards,
Nick Maclaren.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Chris Mellon
On 24 Jan 2007 17:12:19 GMT, Nick Maclaren [EMAIL PROTECTED] wrote:

 In article [EMAIL PROTECTED],
 Carl J. Van Arsdall [EMAIL PROTECTED] writes:
 | Hey everyone, I have a question about python threads.  Before anyone
 | goes further, this is not a debate about threads vs. processes, just a
 | question.
 |
 | With that, are python threads reliable?  Or rather, are they safe?  I've
 | had some strange errors in the past, I use threading.lock for my
 | critical sections, but I wonder if that is really good enough.
 |
 | Does anyone have any conclusive evidence that python threads/locks are
 | safe or unsafe?

 Unsafe.  They are built on top of unsafe primitives (POSIX, Microsoft
 etc.)  Python will shield you from some problems, but not all.

 There is precious little that you can do, because the root cause is
 that the standards and specifications are hopelessly flawed.


This is sufficiently inaccurate that I would call it FUD. Using
threads from Python, as from any other language, requires knowledge of
the tradeoffs and limitations of threading, but claiming that their
usage is *inherently* unsafe isn't true. It is almost certain that
your code and locking are flawed, not that the threads underneath you
are buggy.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread skip

Carl Does anyone have any conclusive evidence that python threads/locks
Carl are safe or unsafe?

In my experience Python threads are generally safer than the programmers
that use them. ;-)

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Carl J. Van Arsdall
[EMAIL PROTECTED] wrote:
 Carl Does anyone have any conclusive evidence that python threads/locks
 Carl are safe or unsafe?

 In my experience Python threads are generally safer than the programmers
 that use them. ;-)
   
Haha, yea, tell me about it.  The whole GIL thing made me nervous about 
the locking operations happening truly atomically and not getting 
weird.  Thanks for ensuring me that i'm just nuts :)

-carl

-- 

Carl J. Van Arsdall
[EMAIL PROTECTED]
Build and Release
MontaVista Software

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Nick Maclaren

In article [EMAIL PROTECTED],
Chris Mellon [EMAIL PROTECTED] writes:
|  |
|  | Does anyone have any conclusive evidence that python threads/locks are
|  | safe or unsafe?
| 
|  Unsafe.  They are built on top of unsafe primitives (POSIX, Microsoft
|  etc.)  Python will shield you from some problems, but not all.
| 
|  There is precious little that you can do, because the root cause is
|  that the standards and specifications are hopelessly flawed.
| 
| This is sufficiently inaccurate that I would call it FUD. Using
| threads from Python, as from any other language, requires knowledge of
| the tradeoffs and limitations of threading, but claiming that their
| usage is *inherently* unsafe isn't true. It is almost certain that
| your code and locking are flawed, not that the threads underneath you
| are buggy.

I suggest that you find out rather more about the ill-definition of
POSIX threading memory model, to name one of the better documented
aspects.  A Web search should provide you with more information on
the ghastly mess than any sane person wants to know.

And that is only one of many aspects :-(


Regards,
Nick Maclaren.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Chris Mellon
On 24 Jan 2007 18:21:38 GMT, Nick Maclaren [EMAIL PROTECTED] wrote:

 In article [EMAIL PROTECTED],
 Chris Mellon [EMAIL PROTECTED] writes:
 |  |
 |  | Does anyone have any conclusive evidence that python threads/locks are
 |  | safe or unsafe?
 | 
 |  Unsafe.  They are built on top of unsafe primitives (POSIX, Microsoft
 |  etc.)  Python will shield you from some problems, but not all.
 | 
 |  There is precious little that you can do, because the root cause is
 |  that the standards and specifications are hopelessly flawed.
 |
 | This is sufficiently inaccurate that I would call it FUD. Using
 | threads from Python, as from any other language, requires knowledge of
 | the tradeoffs and limitations of threading, but claiming that their
 | usage is *inherently* unsafe isn't true. It is almost certain that
 | your code and locking are flawed, not that the threads underneath you
 | are buggy.

 I suggest that you find out rather more about the ill-definition of
 POSIX threading memory model, to name one of the better documented
 aspects.  A Web search should provide you with more information on
 the ghastly mess than any sane person wants to know.

 And that is only one of many aspects :-(


I'm aware of the issues with the POSIX threading model. I still stand
by my statement - bringing up the problems with the provability of
correctness in the POSIX model amounts to FUD in a discussion of
actual problems with actual code.

Logic and programming errors in user code are far more likely to be
the cause of random errors in a threaded program than theoretical
(I've never come across a case in practice) issues with the POSIX
standard.

Emphasizing this means that people will tend to ignore bugs as being
the fault of POSIX rather than either auditing their code more
carefully, or avoiding threads entirely (the second being what I
suspect your goal is).

As a last case, I should point out that while the POSIX memory model
can't be proven safe, concrete implementations do not necessarily
suffer from this problem.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Carl J. Van Arsdall
Chris Mellon wrote:
 On 24 Jan 2007 18:21:38 GMT, Nick Maclaren [EMAIL PROTECTED] wrote:
   
 [snip]

 

 I'm aware of the issues with the POSIX threading model. I still stand
 by my statement - bringing up the problems with the provability of
 correctness in the POSIX model amounts to FUD in a discussion of
 actual problems with actual code.

 Logic and programming errors in user code are far more likely to be
 the cause of random errors in a threaded program than theoretical
 (I've never come across a case in practice) issues with the POSIX
 standard.
   
Yea, typically I would think that.  The problem I am seeing is 
incredibly intermittent.  Like a simple pyro server that gives me a 
problem maybe every three or four months.  Just something funky will 
happen to the state of the whole thing, some bad data, i'm having an 
issue tracking it down and some more experienced programmers mentioned 
that its most likely a race condition.  THe thing is, I'm really not 
doing anything too crazy, so i'm having difficult tracking it down.  I 
had heard in the past that there may be issues with threads, so I 
thought to investigate this side of things. 

It still proves difficult, but reassurance of the threading model helps 
me focus my efforts.

 Emphasizing this means that people will tend to ignore bugs as being
 the fault of POSIX rather than either auditing their code more
 carefully, or avoiding threads entirely (the second being what I
 suspect your goal is).

 As a last case, I should point out that while the POSIX memory model
 can't be proven safe, concrete implementations do not necessarily
 suffer from this problem.
   
Would you consider the Linux implementation of threads to be concrete?

-carl

-- 

Carl J. Van Arsdall
[EMAIL PROTECTED]
Build and Release
MontaVista Software

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Nick Maclaren

In article [EMAIL PROTECTED],
Carl J. Van Arsdall [EMAIL PROTECTED] writes:
| Chris Mellon wrote:
| 
|  Logic and programming errors in user code are far more likely to be
|  the cause of random errors in a threaded program than theoretical
|  (I've never come across a case in practice) issues with the POSIX
|  standard.
|
| Yea, typically I would think that.  The problem I am seeing is 
| incredibly intermittent.  Like a simple pyro server that gives me a 
| problem maybe every three or four months.  Just something funky will 
| happen to the state of the whole thing, some bad data, i'm having an 
| issue tracking it down and some more experienced programmers mentioned 
| that its most likely a race condition.  THe thing is, I'm really not 
| doing anything too crazy, so i'm having difficult tracking it down.  I 
| had heard in the past that there may be issues with threads, so I 
| thought to investigate this side of things. 

I have seen that many dozens of times on half a dozen Unices, but have
only tracked down the cause in a handful of cases.  Of those,
implementation defects that are sanctioned by the standards have
accounted for about half.

Note that the term race condition is accurate but misleading!  One
of the worst problems with POSIX is that it does not define how
non-memory global state is synchronised.  For example, it is possible
for a memory update and an associated signal to occur on different
sides of a synchronisation boundary.  Similarly, it is possible for
I/O to sidestep POSIX's synchronisation boundaries.  I have seen both.

Perhaps the nastiest is that POSIX leaves it unclear whether the
action of synchronisation is transitive.  So, if A synchronises with
B, and then B with C, A may not have synchronised with C.  Again, I
have seen that.  It can happen on Intel systems, according to the
experts I know.

| Would you consider the Linux implementation of threads to be concrete?

In this sort of area, Linux tends to be saner than most systems, but
remember that it has had MUCH less stress testing on threaded codes
than many other Unices.  In fact, it was only a few years ago that
Linux threads became stable enough to be worth using.

Note that failures due to implementation defects and flaws in the
standards are likely to show up in very obscure ways; ones due to
programmer error tend to be much simpler.

If you want to contact me by Email, and can describe technically
what you are doing and (most importantly) what you are assuming, I
may be able to give some hints.  But no promises.


Regards,
Nick Maclaren.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Aahz
In article [EMAIL PROTECTED],
Carl J. Van Arsdall [EMAIL PROTECTED] wrote:

Hey everyone, I have a question about python threads.  Before anyone 
goes further, this is not a debate about threads vs. processes, just a 
question.

With that, are python threads reliable?  Or rather, are they safe?  I've 
had some strange errors in the past, I use threading.lock for my 
critical sections, but I wonder if that is really good enough.

Does anyone have any conclusive evidence that python threads/locks are 
safe or unsafe?

My response is that you're asking the wrong questions here.  Our database
server locked up hard Sunday morning, and we still have no idea why (the
machine itself, not just the database app).  I think it's more important
to focus on whether you have done all that is reasonable to make your
application reliable -- and then put your efforts into making your app
recoverable.

I'm particularly making this comment in the context of your later point
about the bug showing up only every three or four months.

Side note: without knowing what error messages you're getting, there's
not much anybody can say about your programs or the reliability of
threads for your application.
-- 
Aahz ([EMAIL PROTECTED])   * http://www.pythoncraft.com/

Help a hearing-impaired person: http://rule6.info/hearing.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Nick Maclaren

In article [EMAIL PROTECTED],
[EMAIL PROTECTED] (Aahz) writes:
| 
| My response is that you're asking the wrong questions here.  Our database
| server locked up hard Sunday morning, and we still have no idea why (the
| machine itself, not just the database app).  I think it's more important
| to focus on whether you have done all that is reasonable to make your
| application reliable -- and then put your efforts into making your app
| recoverable.

Absolutely!  Shit happens.  In a well-designed world, that would not be
the case, but we don't live in one.  Until you have identified the cause,
you can't tell if threading has anything to do with the failure - given
what we know, it seems likely, but what Aahz says is how to tackle the
problem WHATEVER the cause.


Regards,
Nick Maclaren.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Paddy


On Jan 24, 6:43 pm, Carl J. Van Arsdall [EMAIL PROTECTED]
wrote:
 Chris Mellon wrote:
  On 24 Jan 2007 18:21:38 GMT, Nick Maclaren [EMAIL PROTECTED] wrote:

  [snip]

  I'm aware of the issues with the POSIX threading model. I still stand
  by my statement - bringing up the problems with the provability of
  correctness in the POSIX model amounts to FUD in a discussion of
  actual problems with actual code.

  Logic and programming errors in user code are far more likely to be
  the cause of random errors in a threaded program than theoretical
  (I've never come across a case in practice) issues with the POSIX
  standard.
 Yea, typically I would think that.  The problem I am seeing is
 incredibly intermittent.  Like a simple pyro server that gives me a
 problem maybe every three or four months.  Just something funky will
 happen to the state of the whole thing, some bad data, i'm having an
 issue tracking it down and some more experienced programmers mentioned
 that its most likely a race condition.  THe thing is, I'm really not
 doing anything too crazy, so i'm having difficult tracking it down.  I
 had heard in the past that there may be issues with threads, so I
 thought to investigate this side of things.

 It still proves difficult, but reassurance of the threading model helps
 me focus my efforts.

SNIP
 -carl

Three to four months before `strange errors`? I'd spend some time
correlating logs; not just for your program, but for everything running

on the server. Then I'd expect to cut my losses and arrange to safely
re-start the program every TWO months.
(I'd arrange the re-start after collecting logs but before their
analysis.
Life is too short).

- Paddy.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread John Nagle
Carl J. Van Arsdall wrote:
 Chris Mellon wrote:
 
 On 24 Jan 2007 18:21:38 GMT, Nick Maclaren [EMAIL PROTECTED] wrote:
  

 [snip]

 


 I'm aware of the issues with the POSIX threading model. I still stand
 by my statement - bringing up the problems with the provability of
 correctness in the POSIX model amounts to FUD in a discussion of
 actual problems with actual code.

 Logic and programming errors in user code are far more likely to be
 the cause of random errors in a threaded program than theoretical
 (I've never come across a case in practice) issues with the POSIX
 standard.
   
 
 Yea, typically I would think that.  The problem I am seeing is 
 incredibly intermittent.  Like a simple pyro server that gives me a 
 problem maybe every three or four months.  Just something funky will 
 happen to the state of the whole thing, some bad data, i'm having an 
 issue tracking it down and some more experienced programmers mentioned 
 that its most likely a race condition.  

 Right.  You're at MonteVista, which does real-time Linux systems
and support.  There will be people there who thoroughly understand
thread issues.  (I've used QNX for real time, but MonteVista has
made progress since in recent years.)

 The Python thread documentation is kind of vague about how
well the Python primitives are protected against concurrency problems.
For example, do you have to protect basic types like lists
and hashes against concurrent access?  Is pop atomic?
(It is in dequeue, but what about regular lists?)
Can you crash Python from within Python via concurrency errors?
Does the garbage collector run concurrently or does it freeze
all threads?  What's different depending upon whether you're using
real OS threads or simulated Python threads?

John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Klaas


On Jan 24, 10:43 am, Carl J. Van Arsdall [EMAIL PROTECTED]
wrote:
 Chris Mellon wrote:
  On 24 Jan 2007 18:21:38 GMT, Nick Maclaren [EMAIL PROTECTED] wrote:

  [snip]

  I'm aware of the issues with the POSIX threading model. I still stand
  by my statement - bringing up the problems with the provability of
  correctness in the POSIX model amounts to FUD in a discussion of
  actual problems with actual code.

  Logic and programming errors in user code are far more likely to be
  the cause of random errors in a threaded program than theoretical
  (I've never come across a case in practice) issues with the POSIX
  standard.Yea, typically I would think that.  The problem I am seeing is
 incredibly intermittent.  Like a simple pyro server that gives me a
 problem maybe every three or four months.  Just something funky will
 happen to the state of the whole thing, some bad data, i'm having an
 issue tracking it down and some more experienced programmers mentioned
 that its most likely a race condition.  THe thing is, I'm really not
 doing anything too crazy, so i'm having difficult tracking it down.  I
 had heard in the past that there may be issues with threads, so I
 thought to investigate this side of things.

 It still proves difficult, but reassurance of the threading model helps
 me focus my efforts.

  Emphasizing this means that people will tend to ignore bugs as being
  the fault of POSIX rather than either auditing their code more
  carefully, or avoiding threads entirely (the second being what I
  suspect your goal is).

  As a last case, I should point out that while the POSIX memory model
  can't be proven safe, concrete implementations do not necessarily
  suffer from this problem.Would you consider the Linux implementation of 
  threads to be concrete?

 -carl

 --

 Carl J. Van Arsdall
 [EMAIL PROTECTED]
 Build and Release
 MontaVista Software

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Klaas


On Jan 24, 10:43 am, Carl J. Van Arsdall [EMAIL PROTECTED]
wrote:
 Chris Mellon wrote:
  On 24 Jan 2007 18:21:38 GMT, Nick Maclaren [EMAIL PROTECTED] wrote:

  [snip]

  I'm aware of the issues with the POSIX threading model. I still stand
  by my statement - bringing up the problems with the provability of
  correctness in the POSIX model amounts to FUD in a discussion of
  actual problems with actual code.

  Logic and programming errors in user code are far more likely to be
  the cause of random errors in a threaded program than theoretical
  (I've never come across a case in practice) issues with the POSIX
  standard.Yea, typically I would think that.  The problem I am seeing is
 incredibly intermittent.  Like a simple pyro server that gives me a
 problem maybe every three or four months.  Just something funky will
 happen to the state of the whole thing, some bad data, i'm having an
 issue tracking it down and some more experienced programmers mentioned
 that its most likely a race condition.  THe thing is, I'm really not
 doing anything too crazy, so i'm having difficult tracking it down.  I
 had heard in the past that there may be issues with threads, so I
 thought to investigate this side of things.

 It still proves difficult, but reassurance of the threading model helps
 me focus my efforts.

  Emphasizing this means that people will tend to ignore bugs as being
  the fault of POSIX rather than either auditing their code more
  carefully, or avoiding threads entirely (the second being what I
  suspect your goal is).

  As a last case, I should point out that while the POSIX memory model
  can't be proven safe, concrete implementations do not necessarily
  suffer from this problem.Would you consider the Linux implementation of 
  threads to be concrete?

 -carl

 --

 Carl J. Van Arsdall
 [EMAIL PROTECTED]
 Build and Release
 MontaVista Software

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Klaas
On Jan 24, 10:43 am, Carl J. Van Arsdall [EMAIL PROTECTED]
wrote:

 Yea, typically I would think that.  The problem I am seeing is
 incredibly intermittent.  Like a simple pyro server that gives me a
 problem maybe every three or four months.  Just something funky will
 happen to the state of the whole thing, some bad data, i'm having an
 issue tracking it down and some more experienced programmers mentioned
 that its most likely a race condition.  THe thing is, I'm really not
 doing anything too crazy, so i'm having difficult tracking it down.  I
 had heard in the past that there may be issues with threads, so I
 thought to investigate this side of things.

POSIX issues aside, Python's threading model should be less susceptible
to memory-barrier problems that are possible in other languages (this
is due to the GIL).  Double-checked locking, frinstance, is safe in
python even though it isn't in java.

Are you ever relying solely on the GIL to access shared data?

-Mike

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Paul Rubin
Klaas [EMAIL PROTECTED] writes:
 POSIX issues aside, Python's threading model should be less susceptible
 to memory-barrier problems that are possible in other languages (this
 is due to the GIL). 

But the GIL is not part of Python's threading model; it's just a
particular implementation artifact.  Programs that rely on it are
asking for trouble.

 Double-checked locking, frinstance, is safe in python even though it
 isn't in java.

What's that?

 Are you ever relying solely on the GIL to access shared data?

I think a lot of programs do that, which is probably unwise in the
long run.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Klaas
On Jan 24, 4:11 pm, Paul Rubin http://[EMAIL PROTECTED] wrote:
 Klaas [EMAIL PROTECTED] writes:
  POSIX issues aside, Python's threading model should be less susceptible
  to memory-barrier problems that are possible in other languages (this
  is due to the GIL).

 But the GIL is not part of Python's threading model; it's just a
 particular implementation artifact.  Programs that rely on it are
 asking for trouble.

CPython is more that a particular implementation of python, and the
GIL is more than an artifact.  It is a central tenet of threaded
python programming.

I don't advocate relying on the GIL to manage shared data when
threading, but 1) it is useful for the reasons I mention 2) the OP's
question was almost certainly about an application written for  and run
on CPython.

  Double-checked locking, frinstance, is safe in python even though it
  isn't in java.

 What's that?

google.com

-Mike

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Paul Rubin
Klaas [EMAIL PROTECTED] writes:
 CPython is more that a particular implementation of python,

It's precisely a particular implementation of Python.  Other
implementations include Jython, PyPy, and IronPython.

  and the GIL is more than an artifact.  It is a central tenet of
 threaded python programming.

If it's a central tenet of threaded python programming, why is it not
mentioned at all in the language or library manual?  The threading
module documentation describes the right way to handle thread
synchronization in Python, and that module implements traditional
locking approaches without reference to the GIL.

 I don't advocate relying on the GIL to manage shared data when
 threading, but 1) it is useful for the reasons I mention 2) the OP's
 question was almost certainly about an application written for  and run
 on CPython.

Possibly true.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Damjan
  and the GIL is more than an artifact.  It is a central tenet of
 threaded python programming.
 
 If it's a central tenet of threaded python programming, why is it not
 mentioned at all in the language or library manual?  The threading
 module documentation describes the right way to handle thread
 synchronization in Python, and that module implements traditional
 locking approaches without reference to the GIL.

And we all hope the GIL will one day die it's natural death ... 
maybe... probably.. hopefully ;)


-- 
damjan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: The reliability of python threads

2007-01-24 Thread Klaas
On Jan 24, 5:18 pm, Paul Rubin http://[EMAIL PROTECTED] wrote:
 Klaas [EMAIL PROTECTED] writes:
  CPython is more that a particular implementation of python,

 It's precisely a particular implementation of Python.  Other
 implementations include Jython, PyPy, and IronPython.

I did not deny that it is an implementation of Python.  I deny that it
is but an implementation of Python.

Jython: several versions behind, used primariy for interfacing with
java
PyPy: years away from being a practical platform for replacing CPython
IronPython: best example you've given, but still probably three or four
orders of magnitude less significant that CPython

   and the GIL is more than an artifact.  It is a central tenet of
  threaded python programming.

 If it's a central tenet of threaded python programming, why is it not
 mentioned at all in the language or library manual?

The same reason why IE CSS quirks are not delineated in the HTML 4.01
spec.  This doesn't mean that they aren't central to css web
programming (they are).

How could the GIL, which limits the number of threads in which python
code can be run in a single process to one, NOT be a central part of
threaded python programming?

 The threading
 module documentation describes the right way to handle thread
 synchronization in Python, and that module implements traditional
 locking approaches without reference to the GIL.

No-one has argued that the GIL should be used instead of
threading-based locking.  How could they? The two concepts are not
interchangeable and while they affect each other, are two different
things entirely.  In the post you responded to and quoted I said:

  I don't advocate relying on the GIL to manage shared data when
  threading, 

-Mike

-- 
http://mail.python.org/mailman/listinfo/python-list