[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-03-05 Thread paynen
@Kaan, thank you for continuing to respond. I think it's time to move 
beyond this thread and take action to have the potential issue looked at 
properly.

If you think you've found an issue, you should create a Public Issue 
Tracker thread https://code.google.com/p/googleappengine/issues/list with 
one of the following:

   1. A timeframe on an affected instance, or 
   
   2. An example that can be used to reproduce, or 
   
   3. At least your custom task monitoring solution and leave the creation 
   of a massive burst of tasks to the engineer that attempts to reproduce
   
   4. You could even, in theory, request the thread you create to be set to 
   private so you can upload your whole codebase
   
The bottom line is this: no issue can be looked at or worked on, if it is 
claimed by somebody who saw it in the wild, until it's reproducible or at 
least has some kind of trace in logs somewhere. We're happy to look up logs 
on an affected version/module/instance/etc. during a given timeframe, and 
try to determine what happened, if anything indeed happened. Either of 
these will suffice, but anything less is not really valid when reporting a 
bug. Every software project in the galaxy beyond hobbyist-size will have 
these requirements for bug reporting. At present, no viable data has been 
produced on how to observe the behaviour claimed to occur. All we have is 
the fact that a custom implementation of task-counting produced the numbers 
11476 and 11507 as output. I hope you can see the gap between what's 
produced and what's needed.

If nobody who has observed the issue (if indeed there is such an issue to 
observe) and wants to see a fix (I assume this is you) wants to point 
towards a reproducing example, the very least information requested was a 
time-frame on an affected instance. All that's needed is to say:
 

 On my project primal-buttress-1337, instance 1 of module devel, version 
 v1, began enqueuing 1,000,000 tasks at [TIMESTAMP], and you can see a log 
 line for each task being added, up until the last task is added at 
 [TIMESTAMP]. The last task was executed at [TIMESTAMP]. The number of 
 Flobbb entities created in the Datastore is only 999,998. Therefore I 
 believe this indicates 2 tasks were enqueued but never executed. There are 
 no records in the logs of any task failures.


If and when such information is produced in a Public Issue Tracker thread, 
you'll be pleased to see it taken up, worked on, and a solution will come 
in a course of time appropriate to the severity of the issue. Up to this 
point, such viable data has not been received, and hence the deadlocked 
nature of this thread. 

Please proceed to open a Public Issue Tracker thread with the information 
necessary to report an issue.


On Sunday, March 1, 2015 at 1:23:12 PM UTC-5, Kaan Soral wrote:

 I just ran my tasks, in 2 stages

 In the first stage, task sequences executed as expected, I only noticed a 
 single additional incoming task, thought it was a glitch / repetition in 
 the counting logic
 I also experienced a ndb.delete_multi lock issue that is not related to 
 the taskqueue, but works to count/combine some side metrics, this deadlock 
 prevented that operation, caused some data loss, yet the taskqueue's were 
 unaffected (first time I experienced this issue)

 

 I just ran another set of task sequences, this time I noticed a 
 significant increase in additionally executed tasks

 Workers Out: 11476 In: 11507

 31 tasks re-executed, since these tasks are unique by name, they don't 
 cause task explosions, yet I verify that these additional tasks really 
 executed, because the total number of elements visited are proportionally 
 increased

 (I don't see any task error logs, these issues / executions are all silent)

 

 To sum up, I didn't notice any missing tasks this time around, yet I 
 noticed some invisibly re-executed tasks

 To sum up this entire flurry, the main issue is that, tasks re-execute, 
 fail silently, these are all counted as retries, yet they might happen 100 
 times and count as 100 retries, these retry's should actually be user 
 failures and not internal ones 


-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/057d423c-95f8-4f61-b7e7-d48c68f61c1e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-03-01 Thread Kaan Soral
I just ran my tasks, in 2 stages

In the first stage, task sequences executed as expected, I only noticed a 
single additional incoming task, thought it was a glitch / repetition in 
the counting logic
I also experienced a ndb.delete_multi lock issue that is not related to the 
taskqueue, but works to count/combine some side metrics, this deadlock 
prevented that operation, caused some data loss, yet the taskqueue's were 
unaffected (first time I experienced this issue)



I just ran another set of task sequences, this time I noticed a significant 
increase in additionally executed tasks

Workers Out: 11476 In: 11507

31 tasks re-executed, since these tasks are unique by name, they don't 
cause task explosions, yet I verify that these additional tasks really 
executed, because the total number of elements visited are proportionally 
increased

(I don't see any task error logs, these issues / executions are all silent)



To sum up, I didn't notice any missing tasks this time around, yet I 
noticed some invisibly re-executed tasks

To sum up this entire flurry, the main issue is that, tasks re-execute, 
fail silently, these are all counted as retries, yet they might happen 100 
times and count as 100 retries, these retry's should actually be user 
failures and not internal ones 

-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/2e55afcf-cf1e-41d3-8fbd-894681cbbb52%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-27 Thread husayt
For me the way I noticed it was by chance, also it seems it happened many 
times unnoticed.
I was running a task for each namespace and each task would send an email. 
I noticed I received less emails than number of namespaces. I looked at 
logs there and couldn't find any errors nor even any evidence of these 
tasks ever executing, also i am sure they were sent to queue. As for max 
retries param, I believe I had a max set to 2-3.

Thanks


On Friday, February 27, 2015 at 12:02:07 AM UTC, paynen wrote:

 In addition, re: anybody experiencing this issue, it would be very helpful 
 to include your queue config file, to make sure whether you're specifying 
 any kind of max retries parameter.

 On Thursday, February 26, 2015 at 8:54:02 AM UTC-5, husayt wrote:

 Hi @paynen,
 this is the problem. It's almost not possible to replicate externally, as 
 it happens somewhere in internal appengine stack.

 and the main problem, as also explained by Kaan, it never hits logs.

 So there is not much we can do here as GAE users. This can be replicated 
 only with access to internals of GAE.

 Can I also  stress, that this is the number one issue on my list. I had a 
 support case created and it didn't go forward because I couldn't replicate 
 the problem.

 One thing I can say it more likely to happen when we have bursts of taks.


 Hope this helps,
 HG
 On Wednesday, February 25, 2015 at 10:52:06 PM UTC, paynen wrote:

 If anybody reading other than OP is also affected by this and can 
 provide minimally a reproducing example or an affected timeframe on a given 
 instance, this will be the minimum information needed to look into a 
 potential issue. 

 I'm continuing to monitor this thread, and I hope we can get this 
 addressed as soon as possible, as soon as it's demonstrated/repro'd.

 On Monday, February 23, 2015 at 6:46:49 PM UTC-5, Kaan Soral wrote:

   rate: 500/s

   bucket_size: 100

   retry_parameters:

 task_retry_limit: 6

 min_backoff_seconds: 2

 max_doublings: 3


 Although my queue configuration is broad enough to handle occasional 
 internal failures, I noticed and verified that the taskqueue leaves some 
 tasks unexecuted
 ( 1% to 10%, happens when you burst tasks / run a mapreduce job 
 [custom] - happens both with normal instances and basic_scaling/B4 
 instances )

 I first noticed the issue when some operations that should have done 
 were left undone

 Than I inspected the taskqueue execution with a custom routine that 
 tracks / counts ingoing and executing tasks, a routine that I perfected 
 long ago, and noticed the missing executions

 The issue isn't persistent, after a re-deployment and re-test, the same 
 routine managed to traverse all the entities as it's supposed to

 TL;DR - some taskqueue tasks silently fail to execute, this should 
 never happen, but it happens very frequently without any reason, causes 
 damage and confusion



-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/275d8af0-b49f-4157-ab3a-164c82e57693%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-26 Thread husayt
Hi @paynen,
this is the problem. It's almost not possible to replicate externally, as 
it happens somewhere in internal appengine stack.

and the main problem, as also explained by Kaan, it never hits logs.

So there is not much we can do here as GAE users. This can be replicated 
only with access to internals of GAE.

Can I also  stress, that this is the number one issue on my list. I had a 
support case created and it didn't go forward because I couldn't replicate 
the problem.

One thing I can say it more likely to happen when we have bursts of taks.


Hope this helps,
HG
On Wednesday, February 25, 2015 at 10:52:06 PM UTC, paynen wrote:

 If anybody reading other than OP is also affected by this and can provide 
 minimally a reproducing example or an affected timeframe on a given 
 instance, this will be the minimum information needed to look into a 
 potential issue. 

 I'm continuing to monitor this thread, and I hope we can get this 
 addressed as soon as possible, as soon as it's demonstrated/repro'd.

 On Monday, February 23, 2015 at 6:46:49 PM UTC-5, Kaan Soral wrote:

   rate: 500/s

   bucket_size: 100

   retry_parameters:

 task_retry_limit: 6

 min_backoff_seconds: 2

 max_doublings: 3


 Although my queue configuration is broad enough to handle occasional 
 internal failures, I noticed and verified that the taskqueue leaves some 
 tasks unexecuted
 ( 1% to 10%, happens when you burst tasks / run a mapreduce job [custom] 
 - happens both with normal instances and basic_scaling/B4 instances )

 I first noticed the issue when some operations that should have done were 
 left undone

 Than I inspected the taskqueue execution with a custom routine that 
 tracks / counts ingoing and executing tasks, a routine that I perfected 
 long ago, and noticed the missing executions

 The issue isn't persistent, after a re-deployment and re-test, the same 
 routine managed to traverse all the entities as it's supposed to

 TL;DR - some taskqueue tasks silently fail to execute, this should never 
 happen, but it happens very frequently without any reason, causes damage 
 and confusion



-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/0b23d4fb-28f3-4af7-af58-447a51c5b31f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-26 Thread paynen
Hi @husayt,

Of course, I've been following this thread and understand the issue doesn't 
appear in logs directly. I guess I'm just wondering how you've managed to 
determine that this is happening if there isn't any trace anywhere... If 
there is such a trace, I'd appreciate if you could get it to us, as it's 
necessary to look into it. Whether that's an affected timeframe on a given 
instance, some logs demonstrating the issue (@Kaan above mentioned 5xx 
spikes in their logs, you could try to also make calls to the REST API to 
demonstrate that X tasks were enqueued, but demonstrate only X - N 
finished, thus meaning N went missing), or a minimally-reproducing example. 
Access to the internals of GAE is our specialty ;) We just need a little 
help from you in pointing where to look, since there are a lot of 
internals to look at.

So, do you have the info required? Again, please note that an affected 
timeframe on a given instance (app id, version + module) are potentially 
enough info. This is a high priority for us if it's a high priority for you.

Regards,

NP

On Thursday, February 26, 2015 at 8:54:02 AM UTC-5, husayt wrote:

 Hi @paynen,
 this is the problem. It's almost not possible to replicate externally, as 
 it happens somewhere in internal appengine stack.

 and the main problem, as also explained by Kaan, it never hits logs.

 So there is not much we can do here as GAE users. This can be replicated 
 only with access to internals of GAE.

 Can I also  stress, that this is the number one issue on my list. I had a 
 support case created and it didn't go forward because I couldn't replicate 
 the problem.

 One thing I can say it more likely to happen when we have bursts of taks.


 Hope this helps,
 HG
 On Wednesday, February 25, 2015 at 10:52:06 PM UTC, paynen wrote:

 If anybody reading other than OP is also affected by this and can provide 
 minimally a reproducing example or an affected timeframe on a given 
 instance, this will be the minimum information needed to look into a 
 potential issue. 

 I'm continuing to monitor this thread, and I hope we can get this 
 addressed as soon as possible, as soon as it's demonstrated/repro'd.

 On Monday, February 23, 2015 at 6:46:49 PM UTC-5, Kaan Soral wrote:

   rate: 500/s

   bucket_size: 100

   retry_parameters:

 task_retry_limit: 6

 min_backoff_seconds: 2

 max_doublings: 3


 Although my queue configuration is broad enough to handle occasional 
 internal failures, I noticed and verified that the taskqueue leaves some 
 tasks unexecuted
 ( 1% to 10%, happens when you burst tasks / run a mapreduce job [custom] 
 - happens both with normal instances and basic_scaling/B4 instances )

 I first noticed the issue when some operations that should have done 
 were left undone

 Than I inspected the taskqueue execution with a custom routine that 
 tracks / counts ingoing and executing tasks, a routine that I perfected 
 long ago, and noticed the missing executions

 The issue isn't persistent, after a re-deployment and re-test, the same 
 routine managed to traverse all the entities as it's supposed to

 TL;DR - some taskqueue tasks silently fail to execute, this should never 
 happen, but it happens very frequently without any reason, causes damage 
 and confusion



-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/6c46efeb-d975-4cc9-bf52-b44a39fc3d1e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-26 Thread Kaan Soral
I have a large mapping task I need to run soon, I will report with more 
detail if I detect the issue on that execution

I can't try to reproduce the issue, mainly because I don't have much funds 
to spare, both time and money-wise, however I will help as the issue 
arises, and I really appreciate your help and interest in the issue

I would like to humbly suggest a general method I use a lot instead:

1) Implement a task burst routine, that bursts tasks with different 
configurations, names, no-names, repetitions, loads, no-loads, run them on 
different retry queues
(the tasks can be error-free / simple tasks, as task errors are not the 
issue here)
2) Implement a counting routine / a routine to make sure the tasks executed 
as they should
3) Implement extreme scenarios, for example natural instance deaths etc. 
that happens during these task bursts
4) Run this routine, at different times, on different natural production 
environments (as it seems to me, there are good environments, and there are 
bad ones, and we get random environments instead, this is just an 
impression)

(I would like to humbly suggest a simple approach for all critical 
appengine services, like datastore transactions for example, I noticed 
there is a small possibility transactions re-execute, but I don't bring it 
up yet, as it's impossible to track and report, yet from my internal 
checks, I regularly get flags of routines that re-execute, just an fyi on 
another similar appengine anomaly, these two issues are also similar, they 
both happen on bursts and non-deterministically, a glitchy transaction 
doesn't do much damage, yet a taskqueue one does, as queue executions are 
more vital in most-scenarios)

-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/fda827fd-5d48-49db-a4ba-563b471049f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-26 Thread paynen
In addition, re: anybody experiencing this issue, it would be very helpful 
to include your queue config file, to make sure whether you're specifying 
any kind of max retries parameter.

On Thursday, February 26, 2015 at 8:54:02 AM UTC-5, husayt wrote:

 Hi @paynen,
 this is the problem. It's almost not possible to replicate externally, as 
 it happens somewhere in internal appengine stack.

 and the main problem, as also explained by Kaan, it never hits logs.

 So there is not much we can do here as GAE users. This can be replicated 
 only with access to internals of GAE.

 Can I also  stress, that this is the number one issue on my list. I had a 
 support case created and it didn't go forward because I couldn't replicate 
 the problem.

 One thing I can say it more likely to happen when we have bursts of taks.


 Hope this helps,
 HG
 On Wednesday, February 25, 2015 at 10:52:06 PM UTC, paynen wrote:

 If anybody reading other than OP is also affected by this and can provide 
 minimally a reproducing example or an affected timeframe on a given 
 instance, this will be the minimum information needed to look into a 
 potential issue. 

 I'm continuing to monitor this thread, and I hope we can get this 
 addressed as soon as possible, as soon as it's demonstrated/repro'd.

 On Monday, February 23, 2015 at 6:46:49 PM UTC-5, Kaan Soral wrote:

   rate: 500/s

   bucket_size: 100

   retry_parameters:

 task_retry_limit: 6

 min_backoff_seconds: 2

 max_doublings: 3


 Although my queue configuration is broad enough to handle occasional 
 internal failures, I noticed and verified that the taskqueue leaves some 
 tasks unexecuted
 ( 1% to 10%, happens when you burst tasks / run a mapreduce job [custom] 
 - happens both with normal instances and basic_scaling/B4 instances )

 I first noticed the issue when some operations that should have done 
 were left undone

 Than I inspected the taskqueue execution with a custom routine that 
 tracks / counts ingoing and executing tasks, a routine that I perfected 
 long ago, and noticed the missing executions

 The issue isn't persistent, after a re-deployment and re-test, the same 
 routine managed to traverse all the entities as it's supposed to

 TL;DR - some taskqueue tasks silently fail to execute, this should never 
 happen, but it happens very frequently without any reason, causes damage 
 and confusion



-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/60a43deb-1899-4df8-ae8d-96204af1448d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-25 Thread paynen
If anybody reading other than OP is also affected by this and can provide 
minimally a reproducing example or an affected timeframe on a given 
instance, this will be the minimum information needed to look into a 
potential issue. 

I'm continuing to monitor this thread, and I hope we can get this addressed 
as soon as possible, as soon as it's demonstrated/repro'd.

On Monday, February 23, 2015 at 6:46:49 PM UTC-5, Kaan Soral wrote:

   rate: 500/s

   bucket_size: 100

   retry_parameters:

 task_retry_limit: 6

 min_backoff_seconds: 2

 max_doublings: 3


 Although my queue configuration is broad enough to handle occasional 
 internal failures, I noticed and verified that the taskqueue leaves some 
 tasks unexecuted
 ( 1% to 10%, happens when you burst tasks / run a mapreduce job [custom] - 
 happens both with normal instances and basic_scaling/B4 instances )

 I first noticed the issue when some operations that should have done were 
 left undone

 Than I inspected the taskqueue execution with a custom routine that tracks 
 / counts ingoing and executing tasks, a routine that I perfected long ago, 
 and noticed the missing executions

 The issue isn't persistent, after a re-deployment and re-test, the same 
 routine managed to traverse all the entities as it's supposed to

 TL;DR - some taskqueue tasks silently fail to execute, this should never 
 happen, but it happens very frequently without any reason, causes damage 
 and confusion


-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/b97a79e6-cb98-4b34-b9c6-59b76efea41c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-24 Thread Jason Collins
Hi Kaan,

Just lending support to this:

engineer told me that internal/silent/invisible execution failures also 
 count as task retries


I have also been told this. We have an open-source framework for doing 
workflows based on task queue and one of the features we tried to build in 
to it was an alerting system when the final task retry failed (and thus the 
task would go away permanently). We had to give up on the feature because 
it was possible (and we saw it) for tasks to fail before ever hitting our 
code, and this counts as a retry, and if it happened to be on the last 
retry, then our alert code wouldn't get a chance to operate. 

In short, it would have been an unreliable feature. Note, this was a long 
time ago though (3+ years).

j

On Monday, 23 February 2015 23:49:44 UTC-6, Kaan Soral wrote:

 If there was any logs, it wouldn't be an issue, I would likely solved the 
 issue on the logs, the issue is there are no logs

 With this recent issue, the only error logs are some instances dying 
 critically, from memory overflows's (after going over 256mb's - my theory 
 is a ndb memory leak, but it's offtopic)
 (There are 5-6 error logs like these, all instance overflows, since there 
 are tens of thousands of other operations, the error ratio is low)

 In this specific case, the critical death of instances might be related to 
 the issue, those instances might be taking some tasks with them, and leave 
 the tasks unexecuted (theory)

 I will dig in and find the discussion



-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/db137884-3510-4042-909a-7781f5283d9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-24 Thread Kaan Soral
You can't see these issues from the logs, yet you can see them as 5xx 
spikes from the error *graphs*/dashboard (most of the time)

-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/ae6947f5-0ec1-4011-8efa-3e9361a6e781%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-24 Thread Kaan Soral
Thanks a lot for the support, j and HG, it's great to hear I'm not the only 
one

Jason, it makes sense to let some queues retry indefinitely, I initially 
had only 1 retries, as I thought those retries for user failures, than I 
moved onto my current config, as it's extremely unlikely that 6 retries 
with backoffs could be exhausted, however it seems they can be, similarly, 
there is a possibility that, the issue might also occur with indefinite 
retries too, it's still not 100% certain internal retry failures are the 
cause, although several engineers claimed they were (I generally try to 
avoid indefinite retries, as they might produce serious side effects)

The conversation that I learned about the retry behaviour was with: 
hanssens[t]google.com titled TaskQueue Followup on 3/22/14 (He was very 
helpful, thanks)

You can't see these issues from the logs, yet you can see them as 5xx 
spikes from the error logs (most of the time)

(By the way, this specific instance of the issue cost me almost 200$'s - 
I'm constantly running and inspecting routines that trigger many entities, 
costs a lot, I initially decided to use basic_scaling to prevent 
instance/cost spikes, limit instances, yet basic_scaling made the issue 
worse somehow)

-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/047eaae6-f6b4-44ad-a95a-b993ab2a4aad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-24 Thread Jason Collins


 basic_scaling made the issue worse somehow


So I suspect that the silent failures occur when the dispatcher is 
attempting to service a dequeued task, but cannot find an available 
instance (nor spin one up) in some unknown period of time.

Just a guess though.

j

On Tuesday, 24 February 2015 12:45:18 UTC-6, Kaan Soral wrote:

 You can't see these issues from the logs, yet you can see them as 5xx 
 spikes from the error *graphs*/dashboard (most of the time)


-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/39b982b5-9e9d-4c73-8775-7642933e78ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-24 Thread husayt
Hi Kaan,
I had exactly same issue (and maybe still have, as there is no easy way to 
find it). This caused us a huge problem as we were migrating our namespaces 
to new dataschema and we only found out after some namespaces stopped 
working after migration.
Upon examination we found out that that tasks didn't run for these 
namespaces, there were no error logs and I don't know how many of these 
missed tasks we had before and after.

This is scary. It happens during bursts when we run many tasks at once. The 
major problem is the absence of logs.
It is difficult to reproduce and even more difficult to catch them.

Task queues are offered as a guaranteed task execution solution, but this 
shows they are not.

Best,
HG


On Tuesday, February 24, 2015 at 2:48:34 PM UTC, Jason Collins wrote:

 Hi Kaan,

 Just lending support to this:

 engineer told me that internal/silent/invisible execution failures also 
 count as task retries


 I have also been told this. We have an open-source framework for doing 
 workflows based on task queue and one of the features we tried to build in 
 to it was an alerting system when the final task retry failed (and thus the 
 task would go away permanently). We had to give up on the feature because 
 it was possible (and we saw it) for tasks to fail before ever hitting our 
 code, and this counts as a retry, and if it happened to be on the last 
 retry, then our alert code wouldn't get a chance to operate. 

 In short, it would have been an unreliable feature. Note, this was a long 
 time ago though (3+ years).

 j

 On Monday, 23 February 2015 23:49:44 UTC-6, Kaan Soral wrote:

 If there was any logs, it wouldn't be an issue, I would likely solved the 
 issue on the logs, the issue is there are no logs

 With this recent issue, the only error logs are some instances dying 
 critically, from memory overflows's (after going over 256mb's - my theory 
 is a ndb memory leak, but it's offtopic)
 (There are 5-6 error logs like these, all instance overflows, since there 
 are tens of thousands of other operations, the error ratio is low)

 In this specific case, the critical death of instances might be related 
 to the issue, those instances might be taking some tasks with them, and 
 leave the tasks unexecuted (theory)

 I will dig in and find the discussion



-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/0cbf6d1e-a898-4bb1-be38-2cfe46975d96%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-24 Thread Jason Collins
If you're looking for guaranteed task execution, you probably don't want to 
specify max retries or max age.

But regardless, the fact (if it is still a fact) that a task with max 
retries might never hit your code before disappearing is still a pretty 
unfortunate thing.

j

On Tuesday, 24 February 2015 11:20:55 UTC-6, husayt wrote:

 Hi Kaan,
 I had exactly same issue (and maybe still have, as there is no easy way to 
 find it). This caused us a huge problem as we were migrating our namespaces 
 to new dataschema and we only found out after some namespaces stopped 
 working after migration.
 Upon examination we found out that that tasks didn't run for these 
 namespaces, there were no error logs and I don't know how many of these 
 missed tasks we had before and after.

 This is scary. It happens during bursts when we run many tasks at once. 
 The major problem is the absence of logs.
 It is difficult to reproduce and even more difficult to catch them.

 Task queues are offered as a guaranteed task execution solution, but this 
 shows they are not.

 Best,
 HG


 On Tuesday, February 24, 2015 at 2:48:34 PM UTC, Jason Collins wrote:

 Hi Kaan,

 Just lending support to this:

 engineer told me that internal/silent/invisible execution failures also 
 count as task retries


 I have also been told this. We have an open-source framework for doing 
 workflows based on task queue and one of the features we tried to build in 
 to it was an alerting system when the final task retry failed (and thus the 
 task would go away permanently). We had to give up on the feature because 
 it was possible (and we saw it) for tasks to fail before ever hitting our 
 code, and this counts as a retry, and if it happened to be on the last 
 retry, then our alert code wouldn't get a chance to operate. 

 In short, it would have been an unreliable feature. Note, this was a long 
 time ago though (3+ years).

 j

 On Monday, 23 February 2015 23:49:44 UTC-6, Kaan Soral wrote:

 If there was any logs, it wouldn't be an issue, I would likely solved 
 the issue on the logs, the issue is there are no logs

 With this recent issue, the only error logs are some instances dying 
 critically, from memory overflows's (after going over 256mb's - my theory 
 is a ndb memory leak, but it's offtopic)
 (There are 5-6 error logs like these, all instance overflows, since 
 there are tens of thousands of other operations, the error ratio is low)

 In this specific case, the critical death of instances might be related 
 to the issue, those instances might be taking some tasks with them, and 
 leave the tasks unexecuted (theory)

 I will dig in and find the discussion



-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/dad9e53b-c5e5-461e-8ca9-82acff11ed50%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-23 Thread Kaan Soral
If there was any logs, it wouldn't be an issue, I would likely solved the 
issue on the logs, the issue is there are no logs

With this recent issue, the only error logs are some instances dying 
critically, from memory overflows's (after going over 256mb's - my theory 
is a ndb memory leak, but it's offtopic)
(There are 5-6 error logs like these, all instance overflows, since there 
are tens of thousands of other operations, the error ratio is low)

In this specific case, the critical death of instances might be related to 
the issue, those instances might be taking some tasks with them, and leave 
the tasks unexecuted (theory)

I will dig in and find the discussion

-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/33cb9edb-60cd-4174-9519-36b3bee5e4c2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-23 Thread paynen
Hey Kaan,

That sounds like a serious problem. Is it possible for you to share some 
logs that demonstrate the issue occurring, or better yet a 
minimally-reproducing example? This is vital to identifying and fixing a 
possible issue on the platform.

Secondly, could you possibly point to the issue ticket # or forum thread 
where the discussion with the engineer occurred? I'm interested to see the 
full context of whether this is a known issue or something else.

Regards,

paynen

On Monday, February 23, 2015 at 6:49:57 PM UTC-5, Kaan Soral wrote:

 I also experienced this same issue previously, an engineer told me that 
 internal/silent/invisible execution failures also count as task retries, 
 which is illogical and likely a design/logic fault
 - That's why I increased my retry parameters after that 
 issue/research-flurry and didn't experience task loss up until now, now 
 even with the increased retry parameters, I'm experiencing this issue, on 
 multiple instance/scaling configurations, it's killing me slowly


-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/705f171a-e394-46a2-bf82-5d18bfa74f25%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-23 Thread Kaan Soral
I also experienced this same issue previously, an engineer told me that 
internal/silent/invisible execution failures also count as task retries, 
which is illogical and likely a design/logic fault
- That's why I increased my retry parameters after that 
issue/research-flurry and didn't experience task loss up until now, now 
even with the increased retry parameters, I'm experiencing this issue, on 
multiple instance/scaling configurations, it's killing me slowly

-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/db90c598-2c01-4936-a276-9c6bae345ee2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.