Re: [google-appengine] Re: Re-entrancy / deadlock bug with scheduler?

2011-09-12 Thread Jon McAlister
I just made a change to your app's configuration. Could you retry your
repro? Did this help?

Alternatively, is there anything I could do to trigger the repro?

On Sat, Sep 10, 2011 at 11:02 PM, Pol i...@pol-online.net wrote:
 Still happening right now. I can repro pretty easily: schedule ~10
 image processing tasks in a queue, and 1 to 3 often fail due to that
 re-entrancy bug. Then after they retry automatically and after a few
 times usually go through.

 5 instances idle / running, no load on the app: only 1 user currently.

 It really looks as if the scheduler when getting the callback from the
 EC2 box is trying to serve it from the same instance from which
 urlfetch() is running: there must be a bug where the scheduler thinks
 the instance is not busy or something.

 If you look at the distribution of requests, something's clearly wrong
 as 1 instance gets the majority of the traffic.

 QPS*    Latency*        Requests        Errors  Age     Memory  Availability
 0.000   0.0 ms  7       0       0:03:30 17.7 MBytes     Dynamic
 0.000   0.0 ms  152     1       0:26:56 68.6 MBytes     Dynamic
 0.000   0.0 ms  6       0       0:32:22 23.9 MBytes     Dynamic
 0.000   0.0 ms  4       0       0:32:12 24.6 MBytes     Dynamic
 0.000   0.0 ms  3       0       0:32:09 14.5 MBytes     Dynamic

 On Sep 9, 1:43 pm, Jon McAlister jon...@google.com wrote:
 Ok, let me know if it recurs.







 On Wed, Sep 7, 2011 at 3:07 PM, Pol i...@pol-online.net wrote:
  Hi Jon,

  Great!

  everpix-alpha

  I still see what appears to be the bug now and then, but I don't have
  a reproducible case just waiting in a task queue anymore.

  On Sep 7, 1:19 pm, Jon McAlister jon...@google.com wrote:
  Hi Pol,

  I think I have a change that will fix this, which should be available
  once 1.5.4 is pushed out. What is the app-id here?

  On Mon, Sep 5, 2011 at 10:41 AM, Pol i...@pol-online.net wrote:
   Hi,

   Our app settings are as follow:
   - Python +HRD
   - Max Idle Instances: ( 2 )
   - Min Pending Latency: ( 100ms )
   As of right now, there are 3 instances alive.

   Without going too much into details, we have GAE integrated with EC2
   on which we run remote image processing tools. The tools are called
   directly using HTTP GETs from GAE and they returned their results as
   JSON (with gzip content encoding).

   There are currently 3 tasks in the processing queue on GAE
   continuously failing: the urlfetch() calls to the EC2 tool reach the
   10 seconds timeout and bail. What doesn't make sense is that calling
   the EC2 tool directly using curl from random machines succeeds in less
   than 1 second.

   But here's the trick: under certain circumstances, the EC2 tool will
   call back to GAE (HEAD request that does a single db.get()) to check
   if the image has already been processed and this happens for these 3
   stuck tasks.

   If calling the EC2 tool from the command line and curl, we have the
   normal behavior:
   - EC2 tool retrieves image from arbitrary URL and computes a hash
   - EC2 tool does a HEAD call to GAE passing this hash to see if image
   has been already processed
    - If yes, return empty JSON
    - If no, process image and return full JSON
   This takes about 1 second.

   The exact same call done from GAE produces this behavior:
   - EC2 tool retrieves image from arbitrary URL and computes a hash
   - EC2 tool does a HEAD call to GAE passing this hash to see if image
   has been already processed
    - HEAD call hangs  --- RE-ENTRANCY /DEADLOCKBUG in GAE
    - urlfetch() from GAE to EC2 reaches 10 seconds deadline and
   aborts  -- BREAKSDEADLOCK
    - HEAD call finally returns
   - EC2 tool completes normally

   GAE logs confirm the bug:

   HEAD call from EC2 / curl origina
   2011-09-05 10:19:52.502 /api/has_backing?
   bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
   200 368ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor

   HEAD call from EC2 / GAE origin
   2011-09-05 10:20:44.670 /api/has_backing?
   bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
   200 9712ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor
   2011-09-05 10:20:44.547 /task/import_photo 500 10348ms 28cpu_ms
   8api_cpu_ms 0kb AppEngine-Google; (+http://code.google.com/appengine)
   (see how the HEAD /api/has_backing call hangs for almost 10 seconds
   and only returns *after* /task/import_photo and its urlfetch() call to
   EC2 has aborted)

   And finally, AppStats confirms that it's not the head() Python
   execution itself that's hanging:

   (1) 2011-09-05 09:16:06.843 HEAD /api/has_backing?
   bid=3bc4aeb08e01d3ba4bfab3282d2a198984a4fc1fab2ad9d1e8a39ee3cddd14da
   200 real=227ms cpu=24ms api=8ms overhead=0ms (1 RPC)
   (2) 2011-09-05 09:15:56.422 POST /task/import_photo 500 real=10002ms
   cpu=33ms api=8ms overhead=0ms (3 RPCs)
   (3) 2011-09-05 09:15:49.328 HEAD /api/has_backing?
   bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
   200 real=297ms 

[google-appengine] Re: Re-entrancy / deadlock bug with scheduler?

2011-09-11 Thread Pol
Still happening right now. I can repro pretty easily: schedule ~10
image processing tasks in a queue, and 1 to 3 often fail due to that
re-entrancy bug. Then after they retry automatically and after a few
times usually go through.

5 instances idle / running, no load on the app: only 1 user currently.

It really looks as if the scheduler when getting the callback from the
EC2 box is trying to serve it from the same instance from which
urlfetch() is running: there must be a bug where the scheduler thinks
the instance is not busy or something.

If you look at the distribution of requests, something's clearly wrong
as 1 instance gets the majority of the traffic.

QPS*Latency*RequestsErrors  Age Memory  Availability
0.000   0.0 ms  7   0   0:03:30 17.7 MBytes Dynamic
0.000   0.0 ms  152 1   0:26:56 68.6 MBytes Dynamic
0.000   0.0 ms  6   0   0:32:22 23.9 MBytes Dynamic
0.000   0.0 ms  4   0   0:32:12 24.6 MBytes Dynamic
0.000   0.0 ms  3   0   0:32:09 14.5 MBytes Dynamic

On Sep 9, 1:43 pm, Jon McAlister jon...@google.com wrote:
 Ok, let me know if it recurs.







 On Wed, Sep 7, 2011 at 3:07 PM, Pol i...@pol-online.net wrote:
  Hi Jon,

  Great!

  everpix-alpha

  I still see what appears to be the bug now and then, but I don't have
  a reproducible case just waiting in a task queue anymore.

  On Sep 7, 1:19 pm, Jon McAlister jon...@google.com wrote:
  Hi Pol,

  I think I have a change that will fix this, which should be available
  once 1.5.4 is pushed out. What is the app-id here?

  On Mon, Sep 5, 2011 at 10:41 AM, Pol i...@pol-online.net wrote:
   Hi,

   Our app settings are as follow:
   - Python +HRD
   - Max Idle Instances: ( 2 )
   - Min Pending Latency: ( 100ms )
   As of right now, there are 3 instances alive.

   Without going too much into details, we have GAE integrated with EC2
   on which we run remote image processing tools. The tools are called
   directly using HTTP GETs from GAE and they returned their results as
   JSON (with gzip content encoding).

   There are currently 3 tasks in the processing queue on GAE
   continuously failing: the urlfetch() calls to the EC2 tool reach the
   10 seconds timeout and bail. What doesn't make sense is that calling
   the EC2 tool directly using curl from random machines succeeds in less
   than 1 second.

   But here's the trick: under certain circumstances, the EC2 tool will
   call back to GAE (HEAD request that does a single db.get()) to check
   if the image has already been processed and this happens for these 3
   stuck tasks.

   If calling the EC2 tool from the command line and curl, we have the
   normal behavior:
   - EC2 tool retrieves image from arbitrary URL and computes a hash
   - EC2 tool does a HEAD call to GAE passing this hash to see if image
   has been already processed
    - If yes, return empty JSON
    - If no, process image and return full JSON
   This takes about 1 second.

   The exact same call done from GAE produces this behavior:
   - EC2 tool retrieves image from arbitrary URL and computes a hash
   - EC2 tool does a HEAD call to GAE passing this hash to see if image
   has been already processed
    - HEAD call hangs  --- RE-ENTRANCY /DEADLOCKBUG in GAE
    - urlfetch() from GAE to EC2 reaches 10 seconds deadline and
   aborts  -- BREAKSDEADLOCK
    - HEAD call finally returns
   - EC2 tool completes normally

   GAE logs confirm the bug:

   HEAD call from EC2 / curl origina
   2011-09-05 10:19:52.502 /api/has_backing?
   bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
   200 368ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor

   HEAD call from EC2 / GAE origin
   2011-09-05 10:20:44.670 /api/has_backing?
   bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
   200 9712ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor
   2011-09-05 10:20:44.547 /task/import_photo 500 10348ms 28cpu_ms
   8api_cpu_ms 0kb AppEngine-Google; (+http://code.google.com/appengine)
   (see how the HEAD /api/has_backing call hangs for almost 10 seconds
   and only returns *after* /task/import_photo and its urlfetch() call to
   EC2 has aborted)

   And finally, AppStats confirms that it's not the head() Python
   execution itself that's hanging:

   (1) 2011-09-05 09:16:06.843 HEAD /api/has_backing?
   bid=3bc4aeb08e01d3ba4bfab3282d2a198984a4fc1fab2ad9d1e8a39ee3cddd14da
   200 real=227ms cpu=24ms api=8ms overhead=0ms (1 RPC)
   (2) 2011-09-05 09:15:56.422 POST /task/import_photo 500 real=10002ms
   cpu=33ms api=8ms overhead=0ms (3 RPCs)
   (3) 2011-09-05 09:15:49.328 HEAD /api/has_backing?
   bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
   200 real=297ms cpu=21ms api=8ms overhead=0ms (1 RPC)

   This issue is currently 100% reproducible.

   - Pol

   --
   You received this message because you are subscribed to the Google 
   Groups Google App Engine group.
   To post to this group, send email to 

Re: [google-appengine] Re: Re-entrancy / deadlock bug with scheduler?

2011-09-09 Thread Jon McAlister
Ok, let me know if it recurs.

On Wed, Sep 7, 2011 at 3:07 PM, Pol i...@pol-online.net wrote:
 Hi Jon,

 Great!

 everpix-alpha

 I still see what appears to be the bug now and then, but I don't have
 a reproducible case just waiting in a task queue anymore.

 On Sep 7, 1:19 pm, Jon McAlister jon...@google.com wrote:
 Hi Pol,

 I think I have a change that will fix this, which should be available
 once 1.5.4 is pushed out. What is the app-id here?







 On Mon, Sep 5, 2011 at 10:41 AM, Pol i...@pol-online.net wrote:
  Hi,

  Our app settings are as follow:
  - Python +HRD
  - Max Idle Instances: ( 2 )
  - Min Pending Latency: ( 100ms )
  As of right now, there are 3 instances alive.

  Without going too much into details, we have GAE integrated with EC2
  on which we run remote image processing tools. The tools are called
  directly using HTTP GETs from GAE and they returned their results as
  JSON (with gzip content encoding).

  There are currently 3 tasks in the processing queue on GAE
  continuously failing: the urlfetch() calls to the EC2 tool reach the
  10 seconds timeout and bail. What doesn't make sense is that calling
  the EC2 tool directly using curl from random machines succeeds in less
  than 1 second.

  But here's the trick: under certain circumstances, the EC2 tool will
  call back to GAE (HEAD request that does a single db.get()) to check
  if the image has already been processed and this happens for these 3
  stuck tasks.

  If calling the EC2 tool from the command line and curl, we have the
  normal behavior:
  - EC2 tool retrieves image from arbitrary URL and computes a hash
  - EC2 tool does a HEAD call to GAE passing this hash to see if image
  has been already processed
   - If yes, return empty JSON
   - If no, process image and return full JSON
  This takes about 1 second.

  The exact same call done from GAE produces this behavior:
  - EC2 tool retrieves image from arbitrary URL and computes a hash
  - EC2 tool does a HEAD call to GAE passing this hash to see if image
  has been already processed
   - HEAD call hangs  --- RE-ENTRANCY / DEADLOCK BUG in GAE
   - urlfetch() from GAE to EC2 reaches 10 seconds deadline and
  aborts  -- BREAKS DEADLOCK
   - HEAD call finally returns
  - EC2 tool completes normally

  GAE logs confirm the bug:

  HEAD call from EC2 / curl origina
  2011-09-05 10:19:52.502 /api/has_backing?
  bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
  200 368ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor

  HEAD call from EC2 / GAE origin
  2011-09-05 10:20:44.670 /api/has_backing?
  bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
  200 9712ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor
  2011-09-05 10:20:44.547 /task/import_photo 500 10348ms 28cpu_ms
  8api_cpu_ms 0kb AppEngine-Google; (+http://code.google.com/appengine)
  (see how the HEAD /api/has_backing call hangs for almost 10 seconds
  and only returns *after* /task/import_photo and its urlfetch() call to
  EC2 has aborted)

  And finally, AppStats confirms that it's not the head() Python
  execution itself that's hanging:

  (1) 2011-09-05 09:16:06.843 HEAD /api/has_backing?
  bid=3bc4aeb08e01d3ba4bfab3282d2a198984a4fc1fab2ad9d1e8a39ee3cddd14da
  200 real=227ms cpu=24ms api=8ms overhead=0ms (1 RPC)
  (2) 2011-09-05 09:15:56.422 POST /task/import_photo 500 real=10002ms
  cpu=33ms api=8ms overhead=0ms (3 RPCs)
  (3) 2011-09-05 09:15:49.328 HEAD /api/has_backing?
  bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
  200 real=297ms cpu=21ms api=8ms overhead=0ms (1 RPC)

  This issue is currently 100% reproducible.

  - Pol

  --
  You received this message because you are subscribed to the Google Groups 
  Google App Engine group.
  To post to this group, send email to google-appengine@googlegroups.com.
  To unsubscribe from this group, send email to 
  google-appengine+unsubscr...@googlegroups.com.
  For more options, visit this group 
  athttp://groups.google.com/group/google-appengine?hl=en.

 --
 You received this message because you are subscribed to the Google Groups 
 Google App Engine group.
 To post to this group, send email to google-appengine@googlegroups.com.
 To unsubscribe from this group, send email to 
 google-appengine+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/google-appengine?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.



[google-appengine] Re: Re-entrancy / deadlock bug with scheduler?

2011-09-07 Thread Pol
Hi Jon,

Great!

everpix-alpha

I still see what appears to be the bug now and then, but I don't have
a reproducible case just waiting in a task queue anymore.

On Sep 7, 1:19 pm, Jon McAlister jon...@google.com wrote:
 Hi Pol,

 I think I have a change that will fix this, which should be available
 once 1.5.4 is pushed out. What is the app-id here?







 On Mon, Sep 5, 2011 at 10:41 AM, Pol i...@pol-online.net wrote:
  Hi,

  Our app settings are as follow:
  - Python +HRD
  - Max Idle Instances: ( 2 )
  - Min Pending Latency: ( 100ms )
  As of right now, there are 3 instances alive.

  Without going too much into details, we have GAE integrated with EC2
  on which we run remote image processing tools. The tools are called
  directly using HTTP GETs from GAE and they returned their results as
  JSON (with gzip content encoding).

  There are currently 3 tasks in the processing queue on GAE
  continuously failing: the urlfetch() calls to the EC2 tool reach the
  10 seconds timeout and bail. What doesn't make sense is that calling
  the EC2 tool directly using curl from random machines succeeds in less
  than 1 second.

  But here's the trick: under certain circumstances, the EC2 tool will
  call back to GAE (HEAD request that does a single db.get()) to check
  if the image has already been processed and this happens for these 3
  stuck tasks.

  If calling the EC2 tool from the command line and curl, we have the
  normal behavior:
  - EC2 tool retrieves image from arbitrary URL and computes a hash
  - EC2 tool does a HEAD call to GAE passing this hash to see if image
  has been already processed
   - If yes, return empty JSON
   - If no, process image and return full JSON
  This takes about 1 second.

  The exact same call done from GAE produces this behavior:
  - EC2 tool retrieves image from arbitrary URL and computes a hash
  - EC2 tool does a HEAD call to GAE passing this hash to see if image
  has been already processed
   - HEAD call hangs  --- RE-ENTRANCY / DEADLOCK BUG in GAE
   - urlfetch() from GAE to EC2 reaches 10 seconds deadline and
  aborts  -- BREAKS DEADLOCK
   - HEAD call finally returns
  - EC2 tool completes normally

  GAE logs confirm the bug:

  HEAD call from EC2 / curl origina
  2011-09-05 10:19:52.502 /api/has_backing?
  bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
  200 368ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor

  HEAD call from EC2 / GAE origin
  2011-09-05 10:20:44.670 /api/has_backing?
  bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
  200 9712ms 48cpu_ms 8api_cpu_ms 0kb Everpix-Processor
  2011-09-05 10:20:44.547 /task/import_photo 500 10348ms 28cpu_ms
  8api_cpu_ms 0kb AppEngine-Google; (+http://code.google.com/appengine)
  (see how the HEAD /api/has_backing call hangs for almost 10 seconds
  and only returns *after* /task/import_photo and its urlfetch() call to
  EC2 has aborted)

  And finally, AppStats confirms that it's not the head() Python
  execution itself that's hanging:

  (1) 2011-09-05 09:16:06.843 HEAD /api/has_backing?
  bid=3bc4aeb08e01d3ba4bfab3282d2a198984a4fc1fab2ad9d1e8a39ee3cddd14da
  200 real=227ms cpu=24ms api=8ms overhead=0ms (1 RPC)
  (2) 2011-09-05 09:15:56.422 POST /task/import_photo 500 real=10002ms
  cpu=33ms api=8ms overhead=0ms (3 RPCs)
  (3) 2011-09-05 09:15:49.328 HEAD /api/has_backing?
  bid=90e794f348ac76520076f5d02bc67c87c8a9185b8d36affe8377e73fe4820703
  200 real=297ms cpu=21ms api=8ms overhead=0ms (1 RPC)

  This issue is currently 100% reproducible.

  - Pol

  --
  You received this message because you are subscribed to the Google Groups 
  Google App Engine group.
  To post to this group, send email to google-appengine@googlegroups.com.
  To unsubscribe from this group, send email to 
  google-appengine+unsubscr...@googlegroups.com.
  For more options, visit this group 
  athttp://groups.google.com/group/google-appengine?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.