Hi,
I wanted find out if there are people who are using airflow for their
multi-tenant solution.
The way we are using this is -
For a given customer, we have an hourly pipeline (20 tasks) and a daily
pipeline(10 tasks).
This leads to 1hourly DAG and 1 Daily DAG per customer.
What I am trying to g
1.8: increasing DAGBAG_IMPORT_TIMEOUT helps. I don't see the issue
(although not sure why tasks progress has become slow? But thats not the
issue we are discussing here. So I am ignoring that here)
1.7: our prod is running 1.7 and we havent seen the "defunct process"
issue for more than a week n
We have been using (1.7) over a year and never faced this issue.
The moment we switched to 1.8, I think we have hit this issue.
The reason why I saw "I think" is because I am not sure if it is the same
issue. But whenever I restart, my pipeline proceeds.
*Airflow 1.7Having said that, In 1.7, I d
happens on our set up, on 1.8 as well.
we have kept this number to be 10 which seems to work well for us.
On Fri, Mar 24, 2017 at 12:16 PM, Nicholas Hodgkinson <
nik.hodgkin...@collectivehealth.com> wrote:
> So I'm experiencing a problem that I can't figure out; namely my scheduler
> just stops s
Hi guys,
So I have airflow 1.8 running at my company now. Overall, the performance
have improved and scheduling has been faster.
The jobs are running and the pipeline do progress but I am running into few
issues. Please help if you have seen this before. Any help will be
appreciated.
1. Jobs g
Right now our airflow UI shows the time 06:51 UTC while the correct time
is 05:51 UTC
- Harish
On Thu, Mar 9, 2017 at 1:53 PM, Rob Goretsky
wrote:
> I realized after sending this that some of this behavior about the updated
> 'start_date' not taking effect is explained / addressed in the "Pro
r 30 dags, airflow runs fine after your
> increase of heartbeat ?
> The default is 5 secs.
>
>
> Thanks.
> Jason
>
>
> On Tue, Mar 7, 2017 at 10:24 AM, harish singh
> wrote:
>
> > I had seen a similar behavior, a year ago, when we were are < 5 Dags.
> Even
I had seen a similar behavior, a year ago, when we were are < 5 Dags. Even
then the cpu utilization was reaching 100%.
One way to deal with this is - You could play with "heatbeat" numbers (i.e
increase heartbeat).
But then you are introducing more delay to start jobs that are ready to run
(ready t
future. So performance may still be an issue in the
future).
The numbers used to be 1 core, 8g when the performance (queued -> running)
was very slow.
Thanks,
Harish
On Wed, Dec 7, 2016 at 2:31 PM, harish singh
wrote:
> Any pointers on this?
> How to move the jobs from Queued to Runni
Any pointers on this?
How to move the jobs from Queued to Running faster?
Thanks,
Harish
On Tue, Dec 6, 2016 at 2:01 PM, harish singh
wrote:
> Hi Guys,
>
> Doing a month backfill for all the pipelines has brought up some issues,
> which we may not have noticed before.
>
> O
Hi Guys,
Doing a month backfill for all the pipelines has brought up some issues,
which we may not have noticed before.
One of the issues I am seeing is:
We use airflow pools.
>From what I currently see in the UI, we have a pool named, say, "pool_1"
which has "Queued Slots" = 30
and Used Slots =
>>
>> > > a record in underlying database after there is an explicit call to
>>
>> > airflow
>>
>> > > from that library (using Local Executor). So, I might be wrong, but
>> you
>>
>> > > won't find a record in database un
Hi all,
We have been running Airflow in our production for over 8-9 months now.
I know there is a separate thread in place for Airflow 2.0.
But I was not sure if any of the prior version has this fixed. If not, I
will add this to the other email thread for 2.0.
When I run airflow backfill with "
Hi everyone,
We have been using airflow for roughly about 7 months. Pretty smooth :)
Thanks!
Today when I build my airflow container, I found a weird issue.
This is install command for airflow.
ENV AIRFLOW_VERSION 1.7.0RUN sudo pip install airflow==${AIRFLOW_VERSION}
Now the problem happens
This is how we have been doing it.
pool_name = "pool_A"
num_slots = 10
# creating pool
session = airflow.settings.Session()
pool = (
session.query(Pool)
.filter(Pool.pool == pool_name)
.first())
if not pool:
logging.info("Creating po
Hi Guys,
I am facing this issue I think may be a serious one (or I may be just doing
something totally wrong)
So we use 'pool' in our pipeline.
We have a pool "cpu_pool" with 2 slots.
Now, currently, I see that both the slots are used.
And there are 18 tasks that are Queued for this pool.
We are
On Mon, Jun 27, 2016 at 8:25 AM, Lance Norskog
wrote:
> You can add add retries to the task, including a timeout and a counter. So,
> 5 retries with an hour in between might be a strategy.
>
>
> On Sat, Jun 25, 2016 at 7:24 PM, harish singh
> wrote:
>
> > Hi guys,
Hi guys,
I am trying to build a pipeline/script to monitor our Data-processing
pipeline :)
Basically, I am trying to do these things:
1. Go back in time n hours. and Get status of a TASK for last n hours
(assuming hourly jobs)
I can use the airflow CLI command: "*task_state" * to achieve thi
t; > send no more than 3 requests to this database at once". However, there
> > are
> > > bugs in the scheduler and it is possible to have many active tasks
> > > overscheduled against a pool.
> > >
> > > You can create a pool in the Admin->Pools drop-dow
Hi,
We have been using airflow for few 3 months now.
One pain I felt was, during backfill if I have 2 tasks t1 and t2 - with t1
having depends_on_past=true,
t0 -> t1
t0 -> t2
I find that the task t2 with no past dependency keeps getting scheduled.
This causes the task
Hi guys,
Since we have "dag_conurrency" restriction, I tried to play with
dagrun_timeout.
So that after some interval, dag runs are marked failed and pipeline
progresses.
But this is not happening.
I have this dag (@hourly):
A -> B -> C -> D -> E
C: depends_on_past=true
My dagrun_timeout is
another can of worms, but I think I've seen discussion on here of
> that idea and I thought I'd put my vote in for that pattern.
>
> On Mon, Jun 13, 2016 at 10:35 AM harish singh
> wrote:
>
> > So I changed the scheduler heartbeat to 60 sec
> >
> >
eat of around 30 seconds
(or duration of whatever you believe is your least time consuming
task). This would just be a rough optimization to make sure we make
progress soon enough after the end of a task.
Thanks,Harish
On Mon, Jun 13, 2016 at 9:50 AM, harish singh
wrote:
> yup. it is the
hould look spiky.
>
> Max
>
> On Mon, Jun 13, 2016 at 8:26 AM, harish singh
> wrote:
>
> > Yup, I tried changing the scheduler heartbeat to 60 seconds..
> > Apart from not getting any update for 60 seconds, What are the side
> effects
> > of changing the t
o the scheduler check out
> SCHEDULER_HEARTBEAT_SEC and MAX_THREADS in the scheduler section of
> `airflow.cfg`
>
> Max
>
> On Sun, Jun 12, 2016 at 1:24 PM, harish singh
> wrote:
>
> > Hi guys,
> >
> > We are running airflow (for about 3 months now) inside a docker container
he tasks start_date, in the context of a
> backfill CLI command, get overridden by the backfill’s command start_date.
> This allows for a backfill on tasks that havedepends_on_past=True to
> actually start, if it wasn’t the case, the backfill just wouldn’t start.
>
> On Sun, Jun 12,
These are the default args to my DAG.
I am trying to run a standard hourly job (basically, at the end of
this hour, process last hours data)
I noticed that my pipeline is 1 hour late.
For some reason, I am messing up with my start_date I guess.
What is the best practice for setting up start_date?
Hi guys,
We are running airflow (for about 3 months now) inside a docker container
on aws.
I just did a docker stats to check whats going on. The cpu consumption is
huge.
We have around 15 DAGS. Only one DAG is turned ON. the remaining are OFF.
The DAG runs with a HOURLY schedule.
Right now, air
very expensive? How
can we avoid this?
Thanks.
On Mon, May 16, 2016 at 2:38 AM, Bolke de Bruin wrote:
>
>
> > On 15 mei 2016, at 22:50, harish singh wrote:
> >
> > Our DAG (hourly) has 10 tasks (all of them Bash Operators - issuing curl
> > commands).
> > We ru
Our DAG (hourly) has 10 tasks (all of them Bash Operators - issuing curl
commands).
We run airflow on docker.
When we do a backfill for, say last 10 days, we see that airflow
consistently hits the memory limit (4gb) and the container dies (OOM
Killed).
We increased the memory to 8gb. I still see
Bruin wrote:
>
> > Op 13 mei 2016, om 22:51 heeft harish singh
> het volgende geschreven:
> >
> > Bolke, its 1.7.0
> >
> >
> > On Fri, May 13, 2016 at 1:35 PM, Bolke de Bruin
> wrote:
> >
> >>
> >>> Op 13 mei 2016, om 22:
Bolke, its 1.7.0
On Fri, May 13, 2016 at 1:35 PM, Bolke de Bruin wrote:
>
> > Op 13 mei 2016, om 22:19 heeft harish singh
> het volgende geschreven:
> >
> > Hi guys,
> >
> > I am having an issue with making 'depends_on_past=true' work
> >
Hi guys,
I am having an issue with making 'depends_on_past=true' work
This my pipeline:
a -> b -> c -> d -> e
a -> x -> e
a -> y -> e
I have default args for all Tasks:
scheduling_start_date = (datetime.utcnow() -
datetime.timedelta(hours=1)).replace(minute=0, second=0,
microsecond=0)
defau
any further thoughts on this?
On Sat, Apr 30, 2016 at 5:08 PM, harish singh
wrote:
> Is it possible that the sql you're running to get customer ids is not the
> same every time? That's what I (loosely) meant by non-deterministic.
> [response]
> The sql is the same. But it
mation" -> "store customer i information"). That
> way your DAG and tasks remain stable. Perhaps someone else on the list
> could share an effective pattern along those lines.
>
> Now, if that isn't your situation, something strange is happening that's
&g
*"However a DAG with such a complicated name isn't referenced in the
examplecode (just "Pipeline" + i). My guess is that the DAG id is being
generatedin a non-deterministic or time-based way, and therefore the run
commandcan't find it once the generation criteria change. But hard to say
withoutmore
Oh, thats my bad.
I copied the logs directly. I was trying to simplify the names.
Consider this:
airflow.utils.AirflowException: DAG
[Pipeline_1]
could not be found in /usr/local/airflow/dags/pipeline.py
On Fri, Apr 29, 2016 at 12:19 PM, Jeremiah Lowin wrote:
> That error message usually means t
37 matches
Mail list logo