Re: [google-appengine] Re: Re-entrancy / deadlock bug with scheduler?
I just made a change to your app's configuration. Could you retry your repro? Did this help? Alternatively, is there anything I could do to trigger the repro? On Sat, Sep 10, 2011 at 11:02 PM, Pol i...@pol-online.net wrote: Still happening right now. I can repro pretty easily: schedule ~10 image processing tasks in a queue, and 1 to 3 often fail due to that re-entrancy bug. Then after they retry automatically and after a few times usually go through. 5 instances idle / running, no load on the app: only 1 user currently. It really looks as if the scheduler when getting the callback from the EC2 box is trying to serve it from the same instance from which urlfetch() is running: there must be a bug where the scheduler thinks the instance is not busy or something. If you look at the distribution of requests, something's clearly wrong as 1 instance gets the majority of the traffic. QPS* Latency* Requests Errors Age Memory Availability 0.000 0.0 ms 7 0 0:03:30 17.7 MBytes Dynamic 0.000 0.0 ms 152 1 0:26:56 68.6 MBytes Dynamic 0.000 0.0 ms 6 0 0:32:22 23.9 MBytes Dynamic 0.000 0.0 ms 4 0 0:32:12 24.6 MBytes Dynamic 0.000 0.0 ms 3 0 0:32:09 14.5 MBytes Dynamic On Sep 9, 1:43 pm, Jon McAlister jon...@google.com wrote: Ok, let me know if it recurs. On Wed, Sep 7, 2011 at 3:07 PM, Pol i...@pol-online.net wrote: Hi Jon, Great! everpix-alpha I still see what appears to be the bug now and then, but I don't have a reproducible case just waiting in a task queue anymore. On Sep 7, 1:19 pm, Jon McAlister jon...@google.com wrote: Hi Pol, I think I have a change that will fix this, which should be available once 1.5.4 is pushed out. What is the app-id here? On Mon, Sep 5, 2011 at 10:41 AM, Pol i...@pol-online.net wrote: Hi, Our app settings are as follow: - Python +HRD - Max Idle Instances: ( 2 ) - Min Pending Latency: ( 100ms ) As of right now, there are 3 instances alive. Without going too much into details, we have GAE integrated with EC2 on which we run remote image processing tools. The tools are called directly using HTTP GETs from GAE and they returned their results as JSON (with gzip content encoding). There are currently 3 tasks in the processing queue on GAE continuously failing: the urlfetch() calls to the EC2 tool reach the 10 seconds timeout and bail. What doesn't make sense is that calling the EC2 tool directly using curl from random machines succeeds in less than 1 second. But here's the trick: under certain circumstances, the EC2 tool will call back to GAE (HEAD request that does a single db.get()) to check if the image has already been processed and this happens for these 3 stuck tasks. If calling the EC2 tool from the command line and curl, we have the normal behavior: - EC2 tool retrieves image from arbitrary URL and computes a hash - EC2 tool does a HEAD call to GAE passing this hash to see if image has been already processed - If yes, return empty JSON - If no, process image and return full JSON This takes about 1 second. The exact same call done from GAE produces this behavior: - EC2 tool retrieves image from arbitrary URL and computes a hash - EC2 tool does a HEAD call to GAE passing this hash to see if image has been already processed - HEAD call hangs --- RE-ENTRANCY /DEADLOCKBUG in GAE - urlfetch() from GAE to EC2 reaches 10 seconds deadline and aborts -- BREAKSDEADLOCK - HEAD call finally returns - EC2 tool completes normally GAE logs confirm the bug: HEAD call from EC2 / curl origina 2011-09-05 10:19:52.502 /api/has_backing? bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 200 368ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor HEAD call from EC2 / GAE origin 2011-09-05 10:20:44.670 /api/has_backing? bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 200 9712ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor 2011-09-05 10:20:44.547 /task/import_photo 500 10348ms 28cpu_ms 8api_cpu_ms 0kb AppEngine-Google; (+http://code.google.com/appengine) (see how the HEAD /api/has_backing call hangs for almost 10 seconds and only returns *after* /task/import_photo and its urlfetch() call to EC2 has aborted) And finally, AppStats confirms that it's not the head() Python execution itself that's hanging: (1) 2011-09-05 09:16:06.843 HEAD /api/has_backing? bid=3bc4aeb08e01d3ba4bfab3282d2a198984a4fc1fab2ad9d1e8a39ee3cddd14da 200 real=227ms cpu=24ms api=8ms overhead=0ms (1 RPC) (2) 2011-09-05 09:15:56.422 POST /task/import_photo 500 real=10002ms cpu=33ms api=8ms overhead=0ms (3 RPCs) (3) 2011-09-05 09:15:49.328 HEAD /api/has_backing? bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 200 real=297ms
[google-appengine] Re: Re-entrancy / deadlock bug with scheduler?
Still happening right now. I can repro pretty easily: schedule ~10 image processing tasks in a queue, and 1 to 3 often fail due to that re-entrancy bug. Then after they retry automatically and after a few times usually go through. 5 instances idle / running, no load on the app: only 1 user currently. It really looks as if the scheduler when getting the callback from the EC2 box is trying to serve it from the same instance from which urlfetch() is running: there must be a bug where the scheduler thinks the instance is not busy or something. If you look at the distribution of requests, something's clearly wrong as 1 instance gets the majority of the traffic. QPS*Latency*RequestsErrors Age Memory Availability 0.000 0.0 ms 7 0 0:03:30 17.7 MBytes Dynamic 0.000 0.0 ms 152 1 0:26:56 68.6 MBytes Dynamic 0.000 0.0 ms 6 0 0:32:22 23.9 MBytes Dynamic 0.000 0.0 ms 4 0 0:32:12 24.6 MBytes Dynamic 0.000 0.0 ms 3 0 0:32:09 14.5 MBytes Dynamic On Sep 9, 1:43 pm, Jon McAlister jon...@google.com wrote: Ok, let me know if it recurs. On Wed, Sep 7, 2011 at 3:07 PM, Pol i...@pol-online.net wrote: Hi Jon, Great! everpix-alpha I still see what appears to be the bug now and then, but I don't have a reproducible case just waiting in a task queue anymore. On Sep 7, 1:19 pm, Jon McAlister jon...@google.com wrote: Hi Pol, I think I have a change that will fix this, which should be available once 1.5.4 is pushed out. What is the app-id here? On Mon, Sep 5, 2011 at 10:41 AM, Pol i...@pol-online.net wrote: Hi, Our app settings are as follow: - Python +HRD - Max Idle Instances: ( 2 ) - Min Pending Latency: ( 100ms ) As of right now, there are 3 instances alive. Without going too much into details, we have GAE integrated with EC2 on which we run remote image processing tools. The tools are called directly using HTTP GETs from GAE and they returned their results as JSON (with gzip content encoding). There are currently 3 tasks in the processing queue on GAE continuously failing: the urlfetch() calls to the EC2 tool reach the 10 seconds timeout and bail. What doesn't make sense is that calling the EC2 tool directly using curl from random machines succeeds in less than 1 second. But here's the trick: under certain circumstances, the EC2 tool will call back to GAE (HEAD request that does a single db.get()) to check if the image has already been processed and this happens for these 3 stuck tasks. If calling the EC2 tool from the command line and curl, we have the normal behavior: - EC2 tool retrieves image from arbitrary URL and computes a hash - EC2 tool does a HEAD call to GAE passing this hash to see if image has been already processed - If yes, return empty JSON - If no, process image and return full JSON This takes about 1 second. The exact same call done from GAE produces this behavior: - EC2 tool retrieves image from arbitrary URL and computes a hash - EC2 tool does a HEAD call to GAE passing this hash to see if image has been already processed - HEAD call hangs --- RE-ENTRANCY /DEADLOCKBUG in GAE - urlfetch() from GAE to EC2 reaches 10 seconds deadline and aborts -- BREAKSDEADLOCK - HEAD call finally returns - EC2 tool completes normally GAE logs confirm the bug: HEAD call from EC2 / curl origina 2011-09-05 10:19:52.502 /api/has_backing? bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 200 368ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor HEAD call from EC2 / GAE origin 2011-09-05 10:20:44.670 /api/has_backing? bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 200 9712ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor 2011-09-05 10:20:44.547 /task/import_photo 500 10348ms 28cpu_ms 8api_cpu_ms 0kb AppEngine-Google; (+http://code.google.com/appengine) (see how the HEAD /api/has_backing call hangs for almost 10 seconds and only returns *after* /task/import_photo and its urlfetch() call to EC2 has aborted) And finally, AppStats confirms that it's not the head() Python execution itself that's hanging: (1) 2011-09-05 09:16:06.843 HEAD /api/has_backing? bid=3bc4aeb08e01d3ba4bfab3282d2a198984a4fc1fab2ad9d1e8a39ee3cddd14da 200 real=227ms cpu=24ms api=8ms overhead=0ms (1 RPC) (2) 2011-09-05 09:15:56.422 POST /task/import_photo 500 real=10002ms cpu=33ms api=8ms overhead=0ms (3 RPCs) (3) 2011-09-05 09:15:49.328 HEAD /api/has_backing? bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 200 real=297ms cpu=21ms api=8ms overhead=0ms (1 RPC) This issue is currently 100% reproducible. - Pol -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to
Re: [google-appengine] Re: Re-entrancy / deadlock bug with scheduler?
Ok, let me know if it recurs. On Wed, Sep 7, 2011 at 3:07 PM, Pol i...@pol-online.net wrote: Hi Jon, Great! everpix-alpha I still see what appears to be the bug now and then, but I don't have a reproducible case just waiting in a task queue anymore. On Sep 7, 1:19 pm, Jon McAlister jon...@google.com wrote: Hi Pol, I think I have a change that will fix this, which should be available once 1.5.4 is pushed out. What is the app-id here? On Mon, Sep 5, 2011 at 10:41 AM, Pol i...@pol-online.net wrote: Hi, Our app settings are as follow: - Python +HRD - Max Idle Instances: ( 2 ) - Min Pending Latency: ( 100ms ) As of right now, there are 3 instances alive. Without going too much into details, we have GAE integrated with EC2 on which we run remote image processing tools. The tools are called directly using HTTP GETs from GAE and they returned their results as JSON (with gzip content encoding). There are currently 3 tasks in the processing queue on GAE continuously failing: the urlfetch() calls to the EC2 tool reach the 10 seconds timeout and bail. What doesn't make sense is that calling the EC2 tool directly using curl from random machines succeeds in less than 1 second. But here's the trick: under certain circumstances, the EC2 tool will call back to GAE (HEAD request that does a single db.get()) to check if the image has already been processed and this happens for these 3 stuck tasks. If calling the EC2 tool from the command line and curl, we have the normal behavior: - EC2 tool retrieves image from arbitrary URL and computes a hash - EC2 tool does a HEAD call to GAE passing this hash to see if image has been already processed - If yes, return empty JSON - If no, process image and return full JSON This takes about 1 second. The exact same call done from GAE produces this behavior: - EC2 tool retrieves image from arbitrary URL and computes a hash - EC2 tool does a HEAD call to GAE passing this hash to see if image has been already processed - HEAD call hangs --- RE-ENTRANCY / DEADLOCK BUG in GAE - urlfetch() from GAE to EC2 reaches 10 seconds deadline and aborts -- BREAKS DEADLOCK - HEAD call finally returns - EC2 tool completes normally GAE logs confirm the bug: HEAD call from EC2 / curl origina 2011-09-05 10:19:52.502 /api/has_backing? bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 200 368ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor HEAD call from EC2 / GAE origin 2011-09-05 10:20:44.670 /api/has_backing? bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 200 9712ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor 2011-09-05 10:20:44.547 /task/import_photo 500 10348ms 28cpu_ms 8api_cpu_ms 0kb AppEngine-Google; (+http://code.google.com/appengine) (see how the HEAD /api/has_backing call hangs for almost 10 seconds and only returns *after* /task/import_photo and its urlfetch() call to EC2 has aborted) And finally, AppStats confirms that it's not the head() Python execution itself that's hanging: (1) 2011-09-05 09:16:06.843 HEAD /api/has_backing? bid=3bc4aeb08e01d3ba4bfab3282d2a198984a4fc1fab2ad9d1e8a39ee3cddd14da 200 real=227ms cpu=24ms api=8ms overhead=0ms (1 RPC) (2) 2011-09-05 09:15:56.422 POST /task/import_photo 500 real=10002ms cpu=33ms api=8ms overhead=0ms (3 RPCs) (3) 2011-09-05 09:15:49.328 HEAD /api/has_backing? bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 200 real=297ms cpu=21ms api=8ms overhead=0ms (1 RPC) This issue is currently 100% reproducible. - Pol -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group athttp://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
[google-appengine] Re: Re-entrancy / deadlock bug with scheduler?
Hi Jon, Great! everpix-alpha I still see what appears to be the bug now and then, but I don't have a reproducible case just waiting in a task queue anymore. On Sep 7, 1:19 pm, Jon McAlister jon...@google.com wrote: Hi Pol, I think I have a change that will fix this, which should be available once 1.5.4 is pushed out. What is the app-id here? On Mon, Sep 5, 2011 at 10:41 AM, Pol i...@pol-online.net wrote: Hi, Our app settings are as follow: - Python +HRD - Max Idle Instances: ( 2 ) - Min Pending Latency: ( 100ms ) As of right now, there are 3 instances alive. Without going too much into details, we have GAE integrated with EC2 on which we run remote image processing tools. The tools are called directly using HTTP GETs from GAE and they returned their results as JSON (with gzip content encoding). There are currently 3 tasks in the processing queue on GAE continuously failing: the urlfetch() calls to the EC2 tool reach the 10 seconds timeout and bail. What doesn't make sense is that calling the EC2 tool directly using curl from random machines succeeds in less than 1 second. But here's the trick: under certain circumstances, the EC2 tool will call back to GAE (HEAD request that does a single db.get()) to check if the image has already been processed and this happens for these 3 stuck tasks. If calling the EC2 tool from the command line and curl, we have the normal behavior: - EC2 tool retrieves image from arbitrary URL and computes a hash - EC2 tool does a HEAD call to GAE passing this hash to see if image has been already processed - If yes, return empty JSON - If no, process image and return full JSON This takes about 1 second. The exact same call done from GAE produces this behavior: - EC2 tool retrieves image from arbitrary URL and computes a hash - EC2 tool does a HEAD call to GAE passing this hash to see if image has been already processed - HEAD call hangs --- RE-ENTRANCY / DEADLOCK BUG in GAE - urlfetch() from GAE to EC2 reaches 10 seconds deadline and aborts -- BREAKS DEADLOCK - HEAD call finally returns - EC2 tool completes normally GAE logs confirm the bug: HEAD call from EC2 / curl origina 2011-09-05 10:19:52.502 /api/has_backing? bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 200 368ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor HEAD call from EC2 / GAE origin 2011-09-05 10:20:44.670 /api/has_backing? bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 200 9712ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor 2011-09-05 10:20:44.547 /task/import_photo 500 10348ms 28cpu_ms 8api_cpu_ms 0kb AppEngine-Google; (+http://code.google.com/appengine) (see how the HEAD /api/has_backing call hangs for almost 10 seconds and only returns *after* /task/import_photo and its urlfetch() call to EC2 has aborted) And finally, AppStats confirms that it's not the head() Python execution itself that's hanging: (1) 2011-09-05 09:16:06.843 HEAD /api/has_backing? bid=3bc4aeb08e01d3ba4bfab3282d2a198984a4fc1fab2ad9d1e8a39ee3cddd14da 200 real=227ms cpu=24ms api=8ms overhead=0ms (1 RPC) (2) 2011-09-05 09:15:56.422 POST /task/import_photo 500 real=10002ms cpu=33ms api=8ms overhead=0ms (3 RPCs) (3) 2011-09-05 09:15:49.328 HEAD /api/has_backing? bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703 200 real=297ms cpu=21ms api=8ms overhead=0ms (1 RPC) This issue is currently 100% reproducible. - Pol -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group athttp://groups.google.com/group/google-appengine?hl=en. -- You received this message because you are subscribed to the Google Groups Google App Engine group. To post to this group, send email to google-appengine@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.