subject:"\[google\-appengine\] Re\: \[SEVERE\] App Engine can't consistently execute tasks"

[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-03-05 Thread paynen

@Kaan, thank you for continuing to respond. I think it's time to move
beyond this thread and take action to have the potential issue looked at
properly.

If you think you've found an issue, you should create a Public Issue
Tracker thread https://code.google.com/p/googleappengine/issues/list with
one of the following:

1. A timeframe on an affected instance, or

2. An example that can be used to reproduce, or

3. At least your custom task monitoring solution and leave the creation
of a massive burst of tasks to the engineer that attempts to reproduce

4. You could even, in theory, request the thread you create to be set to
private so you can upload your whole codebase

The bottom line is this: no issue can be looked at or worked on, if it is
claimed by somebody who saw it in the wild, until it's reproducible or at
least has some kind of trace in logs somewhere. We're happy to look up logs
on an affected version/module/instance/etc. during a given timeframe, and
try to determine what happened, if anything indeed happened. Either of
these will suffice, but anything less is not really valid when reporting a
bug. Every software project in the galaxy beyond hobbyist-size will have
these requirements for bug reporting. At present, no viable data has been
produced on how to observe the behaviour claimed to occur. All we have is
the fact that a custom implementation of task-counting produced the numbers
11476 and 11507 as output. I hope you can see the gap between what's
produced and what's needed.

If nobody who has observed the issue (if indeed there is such an issue to
observe) and wants to see a fix (I assume this is you) wants to point
towards a reproducing example, the very least information requested was a
time-frame on an affected instance. All that's needed is to say:

On my project primal-buttress-1337, instance 1 of module devel, version
v1, began enqueuing 1,000,000 tasks at [TIMESTAMP], and you can see a log
line for each task being added, up until the last task is added at
[TIMESTAMP]. The last task was executed at [TIMESTAMP]. The number of
Flobbb entities created in the Datastore is only 999,998. Therefore I
believe this indicates 2 tasks were enqueued but never executed. There are
no records in the logs of any task failures.

If and when such information is produced in a Public Issue Tracker thread,
you'll be pleased to see it taken up, worked on, and a solution will come
in a course of time appropriate to the severity of the issue. Up to this
point, such viable data has not been received, and hence the deadlocked
nature of this thread.

Please proceed to open a Public Issue Tracker thread with the information
necessary to report an issue.

On Sunday, March 1, 2015 at 1:23:12 PM UTC-5, Kaan Soral wrote:

I just ran my tasks, in 2 stages

In the first stage, task sequences executed as expected, I only noticed a
single additional incoming task, thought it was a glitch / repetition in
the counting logic
I also experienced a ndb.delete_multi lock issue that is not related to
the taskqueue, but works to count/combine some side metrics, this deadlock
prevented that operation, caused some data loss, yet the taskqueue's were
unaffected (first time I experienced this issue)

I just ran another set of task sequences, this time I noticed a
significant increase in additionally executed tasks

Workers Out: 11476 In: 11507

31 tasks re-executed, since these tasks are unique by name, they don't
cause task explosions, yet I verify that these additional tasks really
executed, because the total number of elements visited are proportionally
increased

(I don't see any task error logs, these issues / executions are all silent)

To sum up, I didn't notice any missing tasks this time around, yet I
noticed some invisibly re-executed tasks

To sum up this entire flurry, the main issue is that, tasks re-execute,
fail silently, these are all counted as retries, yet they might happen 100
times and count as 100 retries, these retry's should actually be user
failures and not internal ones

--
You received this message because you are subscribed to the Google Groups
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit
https://groups.google.com/d/msgid/google-appengine/057d423c-95f8-4f61-b7e7-d48c68f61c1e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-03-01 Thread Kaan Soral

I just ran my tasks, in 2 stages

In the first stage, task sequences executed as expected, I only noticed a 
single additional incoming task, thought it was a glitch / repetition in 
the counting logic
I also experienced a ndb.delete_multi lock issue that is not related to the 
taskqueue, but works to count/combine some side metrics, this deadlock 
prevented that operation, caused some data loss, yet the taskqueue's were 
unaffected (first time I experienced this issue)



I just ran another set of task sequences, this time I noticed a significant 
increase in additionally executed tasks

Workers Out: 11476 In: 11507

31 tasks re-executed, since these tasks are unique by name, they don't 
cause task explosions, yet I verify that these additional tasks really 
executed, because the total number of elements visited are proportionally 
increased

(I don't see any task error logs, these issues / executions are all silent)



To sum up, I didn't notice any missing tasks this time around, yet I 
noticed some invisibly re-executed tasks

To sum up this entire flurry, the main issue is that, tasks re-execute, 
fail silently, these are all counted as retries, yet they might happen 100 
times and count as 100 retries, these retry's should actually be user 
failures and not internal ones 

-- 
You received this message because you are subscribed to the Google Groups 
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/google-appengine/2e55afcf-cf1e-41d3-8fbd-894681cbbb52%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-27 Thread husayt

For me the way I noticed it was by chance, also it seems it happened many
times unnoticed.
I was running a task for each namespace and each task would send an email.
I noticed I received less emails than number of namespaces. I looked at
logs there and couldn't find any errors nor even any evidence of these
tasks ever executing, also i am sure they were sent to queue. As for max
retries param, I believe I had a max set to 2-3.

Thanks

On Friday, February 27, 2015 at 12:02:07 AM UTC, paynen wrote:

In addition, re: anybody experiencing this issue, it would be very helpful
to include your queue config file, to make sure whether you're specifying
any kind of max retries parameter.

On Thursday, February 26, 2015 at 8:54:02 AM UTC-5, husayt wrote:

Hi @paynen,
this is the problem. It's almost not possible to replicate externally, as
it happens somewhere in internal appengine stack.

and the main problem, as also explained by Kaan, it never hits logs.

So there is not much we can do here as GAE users. This can be replicated
only with access to internals of GAE.

Can I also stress, that this is the number one issue on my list. I had a
support case created and it didn't go forward because I couldn't replicate
the problem.

One thing I can say it more likely to happen when we have bursts of taks.

Hope this helps,
HG
On Wednesday, February 25, 2015 at 10:52:06 PM UTC, paynen wrote:

If anybody reading other than OP is also affected by this and can
provide minimally a reproducing example or an affected timeframe on a given
instance, this will be the minimum information needed to look into a
potential issue.

I'm continuing to monitor this thread, and I hope we can get this
addressed as soon as possible, as soon as it's demonstrated/repro'd.

On Monday, February 23, 2015 at 6:46:49 PM UTC-5, Kaan Soral wrote:

rate: 500/s

bucket_size: 100

retry_parameters:

task_retry_limit: 6

min_backoff_seconds: 2

max_doublings: 3

Although my queue configuration is broad enough to handle occasional
internal failures, I noticed and verified that the taskqueue leaves some
tasks unexecuted
( 1% to 10%, happens when you burst tasks / run a mapreduce job
[custom] - happens both with normal instances and basic_scaling/B4
instances )

I first noticed the issue when some operations that should have done
were left undone

Than I inspected the taskqueue execution with a custom routine that
tracks / counts ingoing and executing tasks, a routine that I perfected
long ago, and noticed the missing executions

The issue isn't persistent, after a re-deployment and re-test, the same
routine managed to traverse all the entities as it's supposed to

TL;DR - some taskqueue tasks silently fail to execute, this should
never happen, but it happens very frequently without any reason, causes
damage and confusion

--
You received this message because you are subscribed to the Google Groups
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit
https://groups.google.com/d/msgid/google-appengine/275d8af0-b49f-4157-ab3a-164c82e57693%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-26 Thread husayt

Hi @paynen,
this is the problem. It's almost not possible to replicate externally, as
it happens somewhere in internal appengine stack.

and the main problem, as also explained by Kaan, it never hits logs.

So there is not much we can do here as GAE users. This can be replicated
only with access to internals of GAE.

Can I also stress, that this is the number one issue on my list. I had a
support case created and it didn't go forward because I couldn't replicate
the problem.

One thing I can say it more likely to happen when we have bursts of taks.

Hope this helps,
HG
On Wednesday, February 25, 2015 at 10:52:06 PM UTC, paynen wrote:

If anybody reading other than OP is also affected by this and can provide
minimally a reproducing example or an affected timeframe on a given
instance, this will be the minimum information needed to look into a
potential issue.

I'm continuing to monitor this thread, and I hope we can get this
addressed as soon as possible, as soon as it's demonstrated/repro'd.

On Monday, February 23, 2015 at 6:46:49 PM UTC-5, Kaan Soral wrote:

rate: 500/s

bucket_size: 100

retry_parameters:

task_retry_limit: 6

min_backoff_seconds: 2

max_doublings: 3

Although my queue configuration is broad enough to handle occasional
internal failures, I noticed and verified that the taskqueue leaves some
tasks unexecuted
( 1% to 10%, happens when you burst tasks / run a mapreduce job [custom]
- happens both with normal instances and basic_scaling/B4 instances )

I first noticed the issue when some operations that should have done were
left undone

Than I inspected the taskqueue execution with a custom routine that
tracks / counts ingoing and executing tasks, a routine that I perfected
long ago, and noticed the missing executions

The issue isn't persistent, after a re-deployment and re-test, the same
routine managed to traverse all the entities as it's supposed to

TL;DR - some taskqueue tasks silently fail to execute, this should never
happen, but it happens very frequently without any reason, causes damage
and confusion

--
You received this message because you are subscribed to the Google Groups
Google App Engine group.
To unsubscribe from this group and stop receiving emails from it, send an email
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit
https://groups.google.com/d/msgid/google-appengine/0b23d4fb-28f3-4af7-af58-447a51c5b31f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[google-appengine] Re: [SEVERE] App Engine can't consistently execute tasks

2015-02-26 Thread paynen

Hi @husayt,

Of course, I've been following this thread and understand the issue doesn't
appear in logs directly. I guess I'm just wondering how you've managed to
determine that this is happening if there isn't any trace anywhere... If
there is such a trace, I'd appreciate if you could get it to us, as it's
necessary to look into it. Whether that's an affected timeframe on a given
instance, some logs demonstrating the issue (@Kaan above mentioned 5xx
spikes in their logs, you could try to also make calls to the REST API to
demonstrate that X tasks were enqueued, but demonstrate only X - N
finished, thus meaning N went missing), or a minimally-reproducing example.
Access to the internals of GAE is our specialty ;) We just need a little
help from you in pointing where to look, since there are a lot of
internals to look at.

So, do you have the info required? Again, please note that an affected
timeframe on a given instance (app id, version + module) are potentially
enough info. This is a high priority for us if it's a high priority for you.

Regards,