[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-03-05 Thread paynen
@Kaan, thank you for continuing to respond. I think it's time to move 
beyond this thread and take action to have the potential issue looked at 
properly.

If you think you've found an issue, you should create a Public Issue 
Tracker thread https://code.google.com/p/googleappengine/issues/list with 
one of the following:

   1. A timeframe on an affected instance, or 
   
   2. An example that can be used to reproduce, or 
   
   3. At least your custom task monitoring solution and leave the creation 
   of a massive burst of tasks to the engineer that attempts to reproduce
   
   4. You could even, in theory, request the thread you create to be set to 
   private so you can upload your whole codebase
   
The bottom line is this: no issue can be looked at or worked on, if it is 
claimed by somebody who saw it in the wild, until it's reproducible or at 
least has some kind of trace in logs somewhere. We're happy to look up logs 
on an affected version/module/instance/etc. during a given timeframe, and 
try to determine what happened, if anything indeed happened. Either of 
these will suffice, but anything less is not really valid when reporting a 
bug. Every software project in the galaxy beyond hobbyist-size will have 
these requirements for bug reporting. At present, no viable data has been 
produced on how to observe the behaviour claimed to occur. All we have is 
the fact that a custom implementation of task-counting produced the numbers 
11476 and 11507 as output. I hope you can see the gap between what's 
produced and what's needed.

If nobody who has observed the issue (if indeed there is such an issue to 
observe) and wants to see a fix (I assume this is you) wants to point 
towards a reproducing example, the very least information requested was a 
time-frame on an affected instance. All that's needed is to say:
 

 On my project primal-buttress-1337, instance 1 of module devel, version 
 v1, began enqueuing 1,000,000 tasks at [TIMESTAMP], and you can see a log 
 line for each task being added, up until the last task is added at 
 [TIMESTAMP]. The last task was executed at [TIMESTAMP]. The number of 
 Flobbb entities created in the Datastore is only 999,998. Therefore I 
 believe this indicates 2 tasks were enqueued but never executed. There are 
 no records in the logs of any task failures.


If and when such information is produced in a Public Issue Tracker thread, 
you'll be pleased to see it taken up, worked on, and a solution will come 
in a course of time appropriate to the severity of the issue. Up to this 
point, such viable data has not been received, and hence the deadlocked 
nature of this thread. 

Please proceed to open a Public Issue Tracker thread with the information 
necessary to report an issue.


On Sunday, March 1, 2015 at 1:23:12 PM UTC-5, Kaan Soral wrote:

 I just ran my tasks, in 2 stages

 In the first stage, task sequences executed as expected, I only noticed a 
 single additional incoming task, thought it was a glitch / repetition in 
 the counting logic
 I also experienced a ndb.delete_multi lock issue that is not related to 
 the taskqueue, but works to count/combine some side metrics, this deadlock 
 prevented that operation, caused some data loss, yet the taskqueue's were 
 unaffected (first time I experienced this issue)

 

 I just ran another set of task sequences, this time I noticed a 
 significant increase in additionally executed tasks

 Workers Out: 11476 In: 11507

 31 tasks re-executed, since these tasks are unique by name, they don't 
 cause task explosions, yet I verify that these additional tasks really 
 executed, because the total number of elements visited are proportionally 
 increased

 (I don't see any task error logs, these issues / executions are all silent)

 

 To sum up, I didn't notice any missing tasks this time around, yet I 
 noticed some invisibly re-executed tasks

 To sum up this entire flurry, the main issue is that, tasks re-execute, 
 fail silently, these are all counted as retries, yet they might happen 100 
 times and count as 100 retries, these retry's should actually be user 
 failures and not internal ones 


-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/057d423c-95f8-4f61-b7e7-d48c68f61c1e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-26 Thread paynen
Hi @husayt,

Of course, I've been following this thread and understand the issue doesn't 
appear in logs directly. I guess I'm just wondering how you've managed to 
determine that this is happening if there isn't any trace anywhere... If 
there is such a trace, I'd appreciate if you could get it to us, as it's 
necessary to look into it. Whether that's an affected timeframe on a given 
instance, some logs demonstrating the issue (@Kaan above mentioned 5xx 
spikes in their logs, you could try to also make calls to the REST API to 
demonstrate that X tasks were enqueued, but demonstrate only X - N 
finished, thus meaning N went missing), or a minimally-reproducing example. 
Access to the internals of GAE is our specialty ;) We just need a little 
help from you in pointing where to look, since there are a lot of 
internals to look at.

So, do you have the info required? Again, please note that an affected 
timeframe on a given instance (app id, version + module) are potentially 
enough info. This is a high priority for us if it's a high priority for you.

Regards,

NP

On Thursday, February 26, 2015 at 8:54:02 AM UTC-5, husayt wrote:

 Hi @paynen,
 this is the problem. It's almost not possible to replicate externally, as 
 it happens somewhere in internal appengine stack.

 and the main problem, as also explained by Kaan, it never hits logs.

 So there is not much we can do here as GAE users. This can be replicated 
 only with access to internals of GAE.

 Can I also  stress, that this is the number one issue on my list. I had a 
 support case created and it didn't go forward because I couldn't replicate 
 the problem.

 One thing I can say it more likely to happen when we have bursts of taks.


 Hope this helps,
 HG
 On Wednesday, February 25, 2015 at 10:52:06 PM UTC, paynen wrote:

 If anybody reading other than OP is also affected by this and can provide 
 minimally a reproducing example or an affected timeframe on a given 
 instance, this will be the minimum information needed to look into a 
 potential issue. 

 I'm continuing to monitor this thread, and I hope we can get this 
 addressed as soon as possible, as soon as it's demonstrated/repro'd.

 On Monday, February 23, 2015 at 6:46:49 PM UTC-5, Kaan Soral wrote:

   rate: 500/s

   bucket_size: 100

   retry_parameters:

 task_retry_limit: 6

 min_backoff_seconds: 2

 max_doublings: 3


 Although my queue configuration is broad enough to handle occasional 
 internal failures, I noticed and verified that the taskqueue leaves some 
 tasks unexecuted
 ( 1% to 10%, happens when you burst tasks / run a mapreduce job [custom] 
 - happens both with normal instances and basic_scaling/B4 instances )

 I first noticed the issue when some operations that should have done 
 were left undone

 Than I inspected the taskqueue execution with a custom routine that 
 tracks / counts ingoing and executing tasks, a routine that I perfected 
 long ago, and noticed the missing executions

 The issue isn't persistent, after a re-deployment and re-test, the same 
 routine managed to traverse all the entities as it's supposed to

 TL;DR - some taskqueue tasks silently fail to execute, this should never 
 happen, but it happens very frequently without any reason, causes damage 
 and confusion



-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/6c46efeb-d975-4cc9-bf52-b44a39fc3d1e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-26 Thread paynen
In addition, re: anybody experiencing this issue, it would be very helpful 
to include your queue config file, to make sure whether you're specifying 
any kind of max retries parameter.

On Thursday, February 26, 2015 at 8:54:02 AM UTC-5, husayt wrote:

 Hi @paynen,
 this is the problem. It's almost not possible to replicate externally, as 
 it happens somewhere in internal appengine stack.

 and the main problem, as also explained by Kaan, it never hits logs.

 So there is not much we can do here as GAE users. This can be replicated 
 only with access to internals of GAE.

 Can I also  stress, that this is the number one issue on my list. I had a 
 support case created and it didn't go forward because I couldn't replicate 
 the problem.

 One thing I can say it more likely to happen when we have bursts of taks.


 Hope this helps,
 HG
 On Wednesday, February 25, 2015 at 10:52:06 PM UTC, paynen wrote:

 If anybody reading other than OP is also affected by this and can provide 
 minimally a reproducing example or an affected timeframe on a given 
 instance, this will be the minimum information needed to look into a 
 potential issue. 

 I'm continuing to monitor this thread, and I hope we can get this 
 addressed as soon as possible, as soon as it's demonstrated/repro'd.

 On Monday, February 23, 2015 at 6:46:49 PM UTC-5, Kaan Soral wrote:

   rate: 500/s

   bucket_size: 100

   retry_parameters:

 task_retry_limit: 6

 min_backoff_seconds: 2

 max_doublings: 3


 Although my queue configuration is broad enough to handle occasional 
 internal failures, I noticed and verified that the taskqueue leaves some 
 tasks unexecuted
 ( 1% to 10%, happens when you burst tasks / run a mapreduce job [custom] 
 - happens both with normal instances and basic_scaling/B4 instances )

 I first noticed the issue when some operations that should have done 
 were left undone

 Than I inspected the taskqueue execution with a custom routine that 
 tracks / counts ingoing and executing tasks, a routine that I perfected 
 long ago, and noticed the missing executions

 The issue isn't persistent, after a re-deployment and re-test, the same 
 routine managed to traverse all the entities as it's supposed to

 TL;DR - some taskqueue tasks silently fail to execute, this should never 
 happen, but it happens very frequently without any reason, causes damage 
 and confusion



-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/60a43deb-1899-4df8-ae8d-96204af1448d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-25 Thread paynen
If anybody reading other than OP is also affected by this and can provide 
minimally a reproducing example or an affected timeframe on a given 
instance, this will be the minimum information needed to look into a 
potential issue. 

I'm continuing to monitor this thread, and I hope we can get this addressed 
as soon as possible, as soon as it's demonstrated/repro'd.

On Monday, February 23, 2015 at 6:46:49 PM UTC-5, Kaan Soral wrote:

   rate: 500/s

   bucket_size: 100

   retry_parameters:

 task_retry_limit: 6

 min_backoff_seconds: 2

 max_doublings: 3


 Although my queue configuration is broad enough to handle occasional 
 internal failures, I noticed and verified that the taskqueue leaves some 
 tasks unexecuted
 ( 1% to 10%, happens when you burst tasks / run a mapreduce job [custom] - 
 happens both with normal instances and basic_scaling/B4 instances )

 I first noticed the issue when some operations that should have done were 
 left undone

 Than I inspected the taskqueue execution with a custom routine that tracks 
 / counts ingoing and executing tasks, a routine that I perfected long ago, 
 and noticed the missing executions

 The issue isn't persistent, after a re-deployment and re-test, the same 
 routine managed to traverse all the entities as it's supposed to

 TL;DR - some taskqueue tasks silently fail to execute, this should never 
 happen, but it happens very frequently without any reason, causes damage 
 and confusion


-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/b97a79e6-cb98-4b34-b9c6-59b76efea41c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-23 Thread paynen
Hey Kaan,

That sounds like a serious problem. Is it possible for you to share some 
logs that demonstrate the issue occurring, or better yet a 
minimally-reproducing example? This is vital to identifying and fixing a 
possible issue on the platform.

Secondly, could you possibly point to the issue ticket # or forum thread 
where the discussion with the engineer occurred? I'm interested to see the 
full context of whether this is a known issue or something else.

Regards,

paynen

On Monday, February 23, 2015 at 6:49:57 PM UTC-5, Kaan Soral wrote:

 I also experienced this same issue previously, an engineer told me that 
 internal/silent/invisible execution failures also count as task retries, 
 which is illogical and likely a design/logic fault
 - That's why I increased my retry parameters after that 
 issue/research-flurry and didn't experience task loss up until now, now 
 even with the increased retry parameters, I'm experiencing this issue, on 
 multiple instance/scaling configurations, it's killing me slowly


-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/705f171a-e394-46a2-bf82-5d18bfa74f25%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.