andreyvital opened a new issue, #39717:
URL: https://github.com/apache/airflow/issues/39717
### Apache Airflow version
2.9.1
### If "Other Airflow 2 version" selected, which one?
_No response_
### What happened?
```
[2024-05-20T12:03:24.184+] {task_co
dhwanitdesai commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2223847945
Wondering if this is related to OTEL. We had OTEL enabled in our instance
and ran into this issue. We have disabled it and testing it out but it looks
promising. We're hosting
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2225593643
> Wondering if this is related to OTEL. We had OTEL enabled in our instance
and ran into this issue. We have disabled it and testing it out but it looks
promising. We're hosting Air
howardyoo commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2225733492
Not really,
What's currently implemented would be OTEL metrics implementation, but
those would normally not directly interfere with how the Airflow should be
running. Denn
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2225751588
I have this issue on a daily basis.
I can help on tracking this down with some guidance if needed!
--
This is an automated message from the Apache Git Service.
To respond t
viitoare commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2225780814
I have a large number of DAGs that depend on some basic DAGs. When I trigger
a large number of DAGs, these DAGs will trigger these basic DAGs through
TriggerDagRunOperator. At thi
dhwanitdesai commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2225862760
Other thing we noticed is we were using some Airflow roles in access_control
which didn't exist on the instance. Airflow wouldn't show this issue anywhere
other than when you
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2225894861
>
finished (failed) although the task says it's queued. (Info: None) Was the
task killed externally
Again the only way to check it @trlopes1974 and @viitoare if your tasks
ferruzzi commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2226104179
> Might be an interesting clue @ferruzzi @howardyoo -> maybe something rings
a bell here ?
Wow, this one is impressive. Multiple different executors really
complicates the
andrew-stein-sp commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2226370708
This is quite the thread...
We really only see the `Airflow Task Timeout Error`'s en masse in our
biggest/busiest cluster, and I'm with @potiuk in that I'm not convin
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2226843279
> y looks like for some reason something is killing your tasks. and this
will happen for example when you have not enough memory (or other resources
that your tasks need
Giuzzilla commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2229434456
To us this issue started regularly happening when we upgraded to 2.9.2 (and
got completely resolved after downgrading to 2.9.0). It happened both with
`CeleryExecutor` and `Local
mchaniotakis commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2230194631
I am experiencing the same error. I am not and Airflow experienced user,
however I am fairly certain this error is not caused from limited resources. In
my case it seems to be
viitoare commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2233593417
Assume that I have 60 DAGs, among which 30 DAGs will trigger another 30 DAGs
via the TriggerDagRunOperator. If I set the parallelism to 32 and trigger these
30 DAGs concurrently,
niegowic commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2233641002
I think we also have similar issue and it was mostly visible with
TriggerDagRunOperator
W dniu Εr., 17.07.2024 o 17:26 viitoare ***@***.***>
napisaΕ(a):
> Assume
DuanWeiFan commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2233759113
We had this issue after we upgraded to 2.9.2, but we found a fix yesterday.
**TLDR' -
Increase AIRFLOW__CELERY__WORKER_CONCURRENCY from 16 -> 64 fixed the
issue.**
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2233876319
interesting info ( or not)
Today we had /var full and some errors arrived on the postgresDB.
Until we have restarted all : postgres, airflow and righ in the en,
ra
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2245721663
humm:
Task Log:
[2024-07-23, 16:45:35 WEST] {standard_task_runner.py:63} INFO - Started
process **4029044** to run task
[2024-07-23, 16:45:35 WEST] {standard_task_r
ndeslandesupgrade commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2261349799
we experienced the same exact issue on Airflow 2.8.4 this morning
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on t
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2261448582
Just for reference...
Are you guys using Hashicorp Vault for secrets backend?
I ask this because we were mapping some if these failures with some issues
accessing vault
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2263227474
@potiuk .
I have managed to get some debug into this. Maybe this can help???
So we are talking about the problem where the CELERY task is marked as
failed but the AIR
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2263453311
hummm, this seems tricky.
we have
grep -i pickle airflow.cfg
**donot_pickle = True**
https://stackoverflow.com/questions/71679921/getting-error-typeerror-ca
NBardelot commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2265502842
> Are you guys using Hashicorp Vault for secrets backend?
We (my team) are.
> I ask this because we were mapping some if these failures with some issues
accessing va
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2265548922
OK. I think we are getting closer to the root cause thanks to that
stacktrace. No picking is involved here.
What happens there:
1) Mini-scheduler is enabled
2) Duri
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2265685010
@NBardelot
I was talking about HTTP proxy.
Has we have different services that need different proxys, I had
reimplemented part of Hashicorp api using python requetes wher
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2265693447
> OK. I think we are getting closer to the root cause thanks to that
stacktrace. No airlfow picking is involved here (pickling is done internally by
deepcopy).
>
> What
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2265709965
> Maybe this is related to SSH /SFTP operators ? I did found a similar issue
refering paramiko ( used in SSH / SFTP)
Quite likely
--
This is an automated message from the
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2268357391
> I think a good solution would be @ashb @ephraimbuddy - following what
you've done in https://github.com/apache/airflow/pull/27506 - to just skip
mini-scheduler when something like
ephraimbuddy commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2268381635
> > I think a good solution would be @ashb @ephraimbuddy - following what
you've done in #27506 - to just skip mini-scheduler when something like that
happens. In this case th
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2268443978
> It's a good idea and would solve at least one of the issues that can lead
to that log message. I'm okay with the solution. Other issues can also lead to
the scheduler sending this
scaoupgrade commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2269606223
I have been following this thread recently since we also experienced this
issue on airlfow `2.8.4`. We have been running on this version for over two
months and this is the fi
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2269672650
I tend to agree with @scaoupgrade .
I also believe that there are 2 issues on this threads. I've mentioned it in
[here](https://github.com/apache/airflow/issues/39717#issuec
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2269784776
@scaoupgrade - @trlopes1974 . Yes. We actually discuseed it few comments
above in case you missed it:
> > It's a good idea and would solve at least one of the issues that can
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2269789609
I can easily close the issue and add simple instructions what anyone who
sees similar issue should do (apply patch and if they see similar issue -
report all the details there).
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2269795434
I'm a noob, but I can follow instructions π
Unfortuntly I only have tomorrow to make ot happen has I'm going on
vacation!πͺπͺπͺ
A segunda, 5/08/2024, 20:45, Jarek P
scaoupgrade commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2269863782
> I'm a noob, but I can follow instructions π Unfortuntly I only have
tomorrow to make ot happen has I'm going on vacation!πͺπͺπͺ A segunda, 5/08/2024,
20:45, Jarek Potiuk ***@***
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2269891549
We did saw some cases where the airflow task was marked as failed, you
could see an external_executer_task_id in the task details but the celery
task never apeard on flower
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2270628845
just got one...
Dagrun Running | Task instance's dagrun was not in the 'running' state but
in the state 'failed'.
-- | --
Task Instance State | Task is
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2270649074
#41276 to take care of the "2" cenario has the "1" has a solution now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2270651176
@potiuk , if you guide-me, I can deploy the fix to our production env and
see if "1" goes away
--
This is an automated message from the Apache Git Service.
To respond to the
potiuk closed issue #39717: Executor reports task instance (...) finished
(failed) although the task says it's queued
URL: https://github.com/apache/airflow/issues/39717
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
potiuk closed issue #39717: Executor reports task instance (...) finished
(failed) although the task says it's queued
URL: https://github.com/apache/airflow/issues/39717
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2271745848
@trlopes1974 -> I see you opened a new issue (cool) - for testing just
apply the patch from #41260 to your installation - it might be building your
own image with the change applie
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2271789918
Will have to postpone that.. Vacation mode is on now!
Tried to change the taskinstance.py but got an error regarding fab_auth
??(Or something).had to revert.
nathadfield commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2122652661
I'm not sure there's an Airflow issue here.
My initial thought is that you are experiencing issues related to your
workers and perhaps they are falling over due to resour
andreyvital commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2122946338
Not sure...it seems related to redis? I have seen other people report
similar issues:
- https://github.com/apache/airflow/discussions/37129
- https://github.com/apache
RNHTTR commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2123324113
I think the log you shared
([source](https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job_runner.py#L776-L780))
erroneously replaced the "stuck in queued" log some
andreyvital commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2123589993
@RNHTTR there's nothing stating "stuck in queued" on scheduler logs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to G
nghilethanh-atherlabs commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2132605772
similar issue here
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the
mikolololoay commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2133266561
I had the same issue when running hundreds of sensors on reschedule mode - a
lot of the times they got stuck in the queued status and raised the same error
that you posted. It
nghilethanh-atherlabs commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2133270480
> I had the same issue when running hundreds of sensors on reschedule mode -
a lot of the times they got stuck in the queued status and raised the same
error that you
andreyvital commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2133534216
Hi @nghilethanh-atherlabs I've been experimenting with those configs as well:
```
#
https://airflow.apache.org/docs/apache-airflow-providers-celery/stable/configuratio
seanmuth commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2142765432
Seeing this issue on 2.9.1 as well, also only with sensors.
We've found that the DAG is timing out trying to fill up the Dagbag on the
worker. Even with debug logs enabled I
nghilethanh-atherlabs commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2143313919
@andreyvital thank you so much for your response. I have setup and it works
really great :)
--
This is an automated message from the Apache Git Service.
To respond
petervanko commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2143584451
I was working on the issue with @seanmuth and increasing parsing time solved
the issue.
It does not fix the root cause, but as a workaround it can save your night...
`
Lee-W commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2148697763
Hello everyone,
I'm currently investigating this issue, but I haven't been able to replicate
it yet. Could you please try setting
`AIRFLOW__CORE__EXECUTE_TASKS_NEW_PYTHON_INTE
niegowic commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2154376103
Spotted same problem with Airflow 2.9.1 - problem didn't occur earlier so
it's strictly related with this version. It happens randomly on random task
execution. Restarting schedul
Lee-W commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2160290930
> Spotted same problem with Airflow 2.9.1 - problem didn't occur earlier so
it's strictly related with this version. It happens randomly on random task
execution. Restarting schedule
NBardelot commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2208307124
We do see a few errors of this kind too, with an Airflow v2.9.2 in
Kubernetes + Celery workers + Redis (AWS Elasticache).
--
This is an automated message from the Apache Git Se
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2208703895
> We do see a few errors of this kind too, with an Airflow v2.9.2 in
Kubernetes + Celery workers + Redis OSS 7.0.7 (AWS Elasticache).
Does it help if you disable "schedule aft
NBardelot commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2208963332
> Does it help if you disable "schedule after task execution"?
Unfortunately, in our case we rely on the feature for some DAGs with many
sequential tasks, and the tradeoff
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2208995849
> Unfortunately, in our case we rely on the feature for some DAGs with many
sequential tasks, and the tradeoff would not be welcomed by our IT teams
(schedule_after_task_execution w
vizeit commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2210232607
I did few tests with new version 2.9.2 and have the following details with
the log
**Configuration**
> Airflow version: 2.9.2
> Compute: GKE
> Executor: CeleryKub
vizeit commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2211493420
> I did few tests with new version 2.9.2 and have the following details with
the log
>
> **Configuration**
>
> > Airflow version: 2.9.2
> > Compute: GKE
> > Execu
vizeit commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2211552110
@Lee-W your suggestion of testing with
AIRFLOW__CORE__EXECUTE_TASKS_NEW_PYTHON_INTERPRETER=True is not practical
because the processing is 5-6 times slower. I have hundreds of dynam
Lee-W commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2211576613
> @Lee-W your suggestion of testing with
AIRFLOW__CORE__EXECUTE_TASKS_NEW_PYTHON_INTERPRETER=True is not practical
because the processing is 5-6 times slower. I have hundreds of dyna
vizeit commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2211611614
@Lee-W [this
line](https://github.com/apache/airflow/blob/2eda7376f4c27df82f0cafba7699bcc46c3fcd05/airflow/providers/celery/executors/celery_executor_utils.py#L150)
may require more
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2211612922
some more info:
Despite that we do have an external task id
**Dependencies Blocking Task From Getting Scheduled
Dependency | Reason
Dagrun Running | Task instance
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2213359801
Detailed logs sequence of failing task.
See that the task 'dispatch_restores' was scheduled/queued at 2024-07-07
13:12:03,993 and marked as fa
NBardelot commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2214240138
> Any particularities/findings/correlated logs and events that happen around
the failures then? Just knowing it happens does not bring us any closer to
diagnosing it.
Noth
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2214287441
> But collecting enough positive cases means you know it's probably not
config specific. And it hardly looks like a race condition, because we have the
error occuring several times
NBardelot commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2217235758
When I say "configuration specific", I mean it from the point-of-view of it
being specific to the use of Kubernetes, or Redis OSS, or a config value. We
use very straightforward
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2217269610
> (and maybe that enters your definition of "config specific").
Yes. The thing is "whatever makes it possible to reproduce the issue". If
the maintainers are not able to repr
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2217421431
Humm.
I'm not sure I agree with the "configuration specific" targeting
problem/issue/ whatever.
It is clear now that this happens with several configurations Kubernets
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2217444706
> I'm not sure I agree with the "configuration specific" targeting
problem/issue/ whatever.
Yes, And as I mentioned before with "survivorship bias" you think that it
affects
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2217496537
I see 2 different problems in this issue:
> 1 - the task is never executed ( it is queued but the scheduler does not
launch it) and this is the case where you have an external_tas
trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2217806992
Well.. I'm up for it!
I can (try) to help diagnosing it. Despite that my lack of knowllege may
limit me on certain spots.
Maybe we could try to drill down this one:
viitoare commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2220263998
I also encountered the same problem. During debugging, I found a very
strange phenomenon. In the airflow.cfg file, I set the parallelism to 300, but
when I printed the value of se
potiuk commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2220285045
> I found a very strange phenomenon. In the airflow.cfg file, I set the
parallelism to 300, but when I printed the value of self.parallelism in the
base_executor.py file, it was 32
viitoare commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2220298270
> > I found a very strange phenomenon. In the airflow.cfg file, I set the
parallelism to 300, but when I printed the value of self.parallelism in the
base_executor.py file, it was
80 matches
Mail list logo