Apparently when Google marks a bug as Obsolete, it becomes secret. I don't 
know why. So here we are.

Reported by pdknsk, Mar 19, 2012
I have two backends which run the exact same code (except for locale 
differences), named UK and US. I haven't updated the code in more than 30 
days. From a few days ago backend UK is having major problems.

Usually both backends run simultaneously about every hour for about 10 
minutes. Backend UK however is running for hours (until I stop/start it), 
but doesn't do anything!

This has implications on backend US as well, because even though it shuts 
down properly, both backend tasks run on the same task queue, and once 
backend UK runs for several hours, the queue stalls and doesn't run any 
enqueued tasks on backend US.

At the end of the single task running on the backend is a major db 
transaction with several puts and few transactional tasks. The only code 
running after it is a single logging.info line. The transaction seems to 
complete, albeit only partially. The tasks are enqueued (and run), and some 
of the puts also complete, but at least one put does not complete 
successfully. I know this because one of the tasks reads back two data sets 
(over)written in the transaction (by two different puts), and does some 
computation on it. The data sets don't "match" though (causing an 
expection), which means either was not overwritten during the same 
transaction.

Therefore my theory is that the transaction "hangs", which causes the task 
not to complete, and the backend to not shut down.

When I stop/start the backend the problem seems to resolve itself for at 
least a few hours (task completes successfully, and backend shuts down), 
but it occurs again eventually.

Why does this only affect backend UK and not backend US, even though it's 
the same code? I don't know. I've noticed this in the backend logs, which 
might help to diagnose the problem. The problem started after this occured. 
It might be a coincidence.

2012-03-16 19:09:58.305 /_ah/stop 500 3ms 0kb
2012-03-16 19:09:57.229 /_ah/stop 500 1ms 0kb
2012-03-16 19:09:56.110 /_ah/stop 500 9ms 0kb
2012-03-16 19:09:55.033 /_ah/stop 500 3ms 0kb
2012-03-16 19:09:53.958 /_ah/stop 500 3ms 0kb
2012-03-16 19:09:52.891 /_ah/stop 500 4ms 0kb
2012-03-16 19:09:51.820 /_ah/stop 500 4ms 0kb
2012-03-16 19:09:50.734 /_ah/stop 500 4ms 0kb
2012-03-16 19:09:49.636 /_ah/stop 500 5ms 0kb
2012-03-16 19:09:48.556 /_ah/stop 500 3ms 0kb
2012-03-16 19:09:47.504 /_ah/stop 500 13ms 0kb
2012-03-16 19:09:46.416 /_ah/stop 500 4ms 0kb
2012-03-16 19:09:45.531 /_ah/stop 500 188ms 0kb
2012-03-16 19:09:44.280 /_ah/stop 500 4ms 0kb
2012-03-16 19:09:43.202 /_ah/stop 500 3ms 0kb
2012-03-16 19:09:42.137 /_ah/stop 500 4ms 0kb
2012-03-16 19:09:41.055 /_ah/stop 500 3ms 0kb
2012-03-16 19:09:38.594 /_ah/start 404 198ms 0kb instance=0

Mar 29, 2012 Delete comment #1 pdknsk
This is still occurring, but less frequently. I've made some changes to the 
transaction (grouping several puts into just two puts, and moving db reads 
out of the transaction), which might've helped.

I've added lots of logging and debugging, and this is what happens 
illustrated with slightly simplified code.

...
logging.info(4)
simple_function(1)
logging.info(5)
simple_function(2)
logging.info(6)
db.run_in_transaction(function)
logging.info(7)

The transaction completes, which I can verify because of the tasks it 
starts and the db entries it writes, but logging stops at '4'. And the 
backend doesn't stop but keeps "running" for hours.

This shouldn't be happening but it is!

Mar 29, 2012 Delete comment #2 pdknsk
Just to add.

1.) This is not merely a logging problem. The instance is running, as 
verified in the instances tab. Memory usage remains unchanged, and CPU at 
0%.

2.) Inconsistent puts inside the transaction are probably unrelated, and 
instead an entirely different problem as explained in issue 7219. I haven't 
noticed it anyway since grouping puts, so this part of the bug report can 
be most likely ignored.

Apr 15, 2012 #3 nickjohn...@google.com
Status: Acknowledged 
Labels: -Type-Defect Type-Production Component-Backends 

Apr 16, 2012 Delete comment #4 pdknsk
The "probably because db transaction hangs" part of the topic can be 
removed, because this is most likely not the cause and when it occurred it 
was probably a fluke or a different problem.

I've done further code changes and haven't noticed the problem since. I'm 
quite sure it's logservice related. I've had imported logservice, which due 
to still unfixed issue 6528, probably only very few actually use correctly. 
Which is my explanation why this bug wasn't reported. I've removed 
logservice and also moved more code outside the transaction (including 
logging), to basically just have a single put of a list and an async 
memcache call.

So IMO this bug can be closed. I'm sure there is still a bug in this 
somewhere, but very difficult to debug and find.

Apr 16, 2012 Delete comment #5 pdknsk
Just for reference some logs.

2012-03-17 20:13:18.909 /_ah/stop 500 10ms 0kb
2012-03-17 20:13:18.905 /amzn/ 500 59749427ms 0kb instance=0
2012-03-17 03:37:29.731 /_ah/start 404 178ms 0kb instance=0

2012-03-19 15:09:21.729 /amzn/ 500 15583876ms 0kb instance=0
2012-03-19 15:09:21.729 /_ah/stop 500 133ms 0kb
2012-03-19 10:49:38.303 /_ah/start 404 297ms 0kb instance=0

And when I already made some code changes plus added debugging.

2012-03-26 16:29:10.020 /_ah/stop 500 116ms 0kb
2012-03-26 16:29:09.933 /amzn/ 500 10148299ms 0kb instance=0
2012-03-26 16:29:08.981 /_ah/stop 500 30383ms 0kb
2012-03-26 13:40:02.011 /_ah/start 404 366ms 0kb instance=0

Apr 22, 2012 Delete comment #6 pdknsk
Well, the problem has returned :(

2012-04-22 14:13:59.222 /_ah/stop 500 46ms 0kb
2012-04-22 14:13:59.185 /amzn/ 500 17333262ms 0kb instance=0
2012-04-22 14:13:58.297 /_ah/stop 500 30240ms 0kb
2012-04-22 09:25:06.297 /_ah/start 404 289ms 0kb instance=0

Apr 22, 2012 Delete comment #7 pdknsk
Just want to add, in the last case, it definitely reached the end of the 
handler because the last line is logging.info() and logged it has. So it 
seems like if the backend just doesn't get the /_ah/stop signal.

Apr 22, 2012 Delete comment #8 pdknsk
It seems to be occuring more frequently again. The INFO is literally the 
last line in the handler, so there is no reason the backend is still 
"active" and doesn't complete.

2012-04-22 18:08:25.095 /amzn/ active start=2012-04-22,11:02:55.767 
instance=0
I 2012-04-22 18:08:25.094 2831 3555 2568

When I stop the backend, the log obviously changes.

2012-04-22 18:14:27.759 /_ah/stop 500 51ms 0kb
2012-04-22 18:14:27.734 /amzn/ 500 691966ms 0kb instance=0
I 2012-04-22 18:08:25.094 2831 3555 2568
E 2012-04-22 18:14:27.726 Process terminated because the backend took too 
long to shutdown.
2012-04-22 18:14:26.785 /_ah/stop 500 30139ms 0kb

May 8, 2012 #9 i...@google.com
(No comment was entered for this change.)
Status: Obsolete 



-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to