Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-14 Thread Ruslan Dautkhanov
`airflow kerberos` is broken in 1.8-rc5
https://issues.apache.org/jira/browse/AIRFLOW-987
Hopefully fix can be part of the 1.8 release.



-- 
Ruslan Dautkhanov

On Tue, Mar 14, 2017 at 6:19 PM, siddharth anand  wrote:

> FYI,
> I've just hit a major bug in the release candidate related to "clear task"
> behavior.
>
> I've been running airflow in both stage and prod since yesterday on rc5 and
> have reproduced this in both environments. I will file a JIRA for this
> tonight, but wanted to send a note over email as well.
>
> In my example, I have a 2 task DAG. For a given DAG run that has completed
> successfully, if I
> 1) clear task2 (leaf task in this case), the previously-successful DAG Run
> goes back to Running, requeues, and executes the task successfully. The DAG
> Run the returns from Running to Success.
> 2) clear task1 (root task in this case), the previously-successful DAG Run
> goes back to Running, DOES NOT requeue or execute the task at all. The DAG
> Run the returns from Running to Success though it never ran the task.
>
> 1) is expected and previous behavior. 2) is a regression.
>
> The only workaround is to use the CLI to run the task cleared. Here are
> some images :
> *After Clearing the Tasks*
> https://www.dropbox.com/s/wmuxt0krwx6wurr/Screenshot%
> 202017-03-14%2014.09.34.png?dl=0
>
> *After DAG Runs return to Success*
> https://www.dropbox.com/s/qop933rzgdzchpd/Screenshot%
> 202017-03-14%2014.09.49.png?dl=0
>
> This is a major regression because it will force everyone to use the CLI
> for things that they would normally use the UI for.
>
> -s
>
>
> -s
>
>
> On Tue, Mar 14, 2017 at 1:32 PM, Daniel Huang  wrote:
>
> > +1 (non-binding)!
> >
> > On Tue, Mar 14, 2017 at 11:35 AM, siddharth anand 
> > wrote:
> >
> > > +1 (binding)
> > >
> > >
> > > On Tue, Mar 14, 2017 at 8:42 AM, Maxime Beauchemin <
> > > maximebeauche...@gmail.com> wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > On Tue, Mar 14, 2017 at 3:59 AM, Alex Van Boxel 
> > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > Note: we had to revert all our ONE_SUCCESS with ALL_SUCCESS trigger
> > > rules
> > > > > where the parent nodes where joining with a SKIP. But I can of
> should
> > > > have
> > > > > known this was coming. Apart of that I had a successful run last
> > night.
> > > > >
> > > > >
> > > > > On Tue, Mar 14, 2017 at 1:37 AM siddharth anand  >
> > > > wrote:
> > > > >
> > > > > I'm going to deploy this to staging now. Fab work Bolke!
> > > > > -s
> > > > >
> > > > > On Mon, Mar 13, 2017 at 2:16 PM, Dan Davydov <
> dan.davy...@airbnb.com
> > .
> > > > > invalid
> > > > > > wrote:
> > > > >
> > > > > > I'll test this on staging as soon as I get a chance (the testing
> is
> > > > > > non-blocking on the rc5). Bolke very much in particular :).
> > > > > >
> > > > > > On Mon, Mar 13, 2017 at 10:46 AM, Jeremiah Lowin <
> > jlo...@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > +1 (binding) extremely impressed by the work and diligence all
> > > > > > contributors
> > > > > > > have put in to getting these blockers fixed, Bolke in
> particular.
> > > > > > >
> > > > > > > On Mon, Mar 13, 2017 at 1:07 AM Arthur Wiedmer <
> > art...@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > Thanks again for steering us through Bolke.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Arthur
> > > > > > > >
> > > > > > > > On Sun, Mar 12, 2017 at 9:59 PM, Bolke de Bruin <
> > > bdbr...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Dear All,
> > > > > > > > >
> > > > > > > > > Finally, I have been able to make the FIFTH RELEASE
> CANDIDATE
> > > of
> > > > > > > Airflow
> > > > > >

`airflow webserver -D` runs in foreground

2017-03-17 Thread Ruslan Dautkhanov
$ pip freeze
airflow==*1.8.0rc5*+apache.incubating

airflow webserver doesn't want to daemonize


$ airflow webserver --daemon
[2017-03-17 00:06:37,553] {__init__.py:57} INFO - Using executor
LocalExecutor

.. skip ..
Running the Gunicorn Server with:
Workers: 4 sync
Host: 0.0.0.0:18111
Timeout: 120
Logfiles: - -
=
[2017-03-17 00:06:39,744] {__init__.py:57} INFO - Using executor
LocalExecutor


It keeps running in foreground.
I am probably missing something simple?

ps. Good project name - airflow is the 1st one in `pip freeze` output :)


Thanks,
Ruslan Dautkhanov


Re: `airflow webserver -D` runs in foreground

2017-03-17 Thread Ruslan Dautkhanov
Thanks Bolke.

I didn't find a jira regarding this issue so filed a new one
https://issues.apache.org/jira/browse/AIRFLOW-1004



-- 
Ruslan Dautkhanov

On Fri, Mar 17, 2017 at 12:36 PM, Bolke de Bruin  wrote:

> This is a (known) bug, since the introduction of the rolling restarts.
>
> Bolke.
>
> > On 17 Mar 2017, at 09:48, Ruslan Dautkhanov 
> wrote:
> >
> > $ pip freeze
> > airflow==*1.8.0rc5*+apache.incubating
> >
> > airflow webserver doesn't want to daemonize
> >
> >
> > $ airflow webserver --daemon
> > [2017-03-17 00:06:37,553] {__init__.py:57} INFO - Using executor
> > LocalExecutor
> >
> > .. skip ..
> > Running the Gunicorn Server with:
> > Workers: 4 sync
> > Host: 0.0.0.0:18111
> > Timeout: 120
> > Logfiles: - -
> > =
> > [2017-03-17 00:06:39,744] {__init__.py:57} INFO - Using executor
> > LocalExecutor
> >
> >
> > It keeps running in foreground.
> > I am probably missing something simple?
> >
> > ps. Good project name - airflow is the 1st one in `pip freeze` output :)
> >
> >
> > Thanks,
> > Ruslan Dautkhanov
>
>


SQLOperator?

2017-03-17 Thread Ruslan Dautkhanov
I can't find references to SQLOperator neither in the source code nor in
the API Reference.

Although it is mentioned in Concepts page :

https://github.com/apache/incubator-airflow/blob/master/docs/concepts.rst#operators



   - SqlOperator - executes a SQL command

Sorry for basic questions - just started using Airflow this week.
Did it got replaced with something else? If so what that is?



Thanks,
Ruslan Dautkhanov


Re: SQLOperator?

2017-03-17 Thread Ruslan Dautkhanov
Thanks Alex

I did notice MySQL one

We are working with Oracle

I guess I will have to call OracleHook from PythonOperator?



On Fri, Mar 17, 2017 at 9:25 PM Alex Guziel 
wrote:

> I'm not sure if that one went away but there are different SQL operators,
> like MySqlOperator, MsSqlOperator, etc. that I see.
>
> Best,
> Alex
>
> On Fri, Mar 17, 2017 at 7:56 PM, Ruslan Dautkhanov 
> wrote:
>
> > I can't find references to SQLOperator neither in the source code nor in
> > the API Reference.
> >
> > Although it is mentioned in Concepts page :
> >
> > https://github.com/apache/incubator-airflow/blob/master/
> > docs/concepts.rst#operators
> >
> >
> >
> >- SqlOperator - executes a SQL command
> >
> > Sorry for basic questions - just started using Airflow this week.
> > Did it got replaced with something else? If so what that is?
> >
> >
> >
> > Thanks,
> > Ruslan Dautkhanov
> >
>


Re: SQLOperator?

2017-03-18 Thread Ruslan Dautkhanov
I just noticed that there is already a Oracle_Operator

https://github.com/apache/incubator-airflow/blob/master/airflow/operators/oracle_operator.py

"Concepts" page has to be updated probably to reflect that there is no
SQLOpartor.

Yep, I have cx_Oracle installed.

Thanks Alex and Boris.



-- 
Ruslan Dautkhanov

On Sat, Mar 18, 2017 at 12:20 PM, Boris Tyukin 
wrote:

> >>I guess I will have to call OracleHook from PythonOperator?
>
> that's what I do! works fine. do not forget to install oracle_cx library
> https://pypi.python.org/pypi/cx_Oracle/5.2.1
>
> you can also create your own operator using one of existing ones as an
> example
>
> On Fri, Mar 17, 2017 at 11:49 PM, Ruslan Dautkhanov 
> wrote:
>
> > Thanks Alex
> >
> > I did notice MySQL one
> >
> > We are working with Oracle
> >
> > I guess I will have to call OracleHook from PythonOperator?
> >
> >
> >
> > On Fri, Mar 17, 2017 at 9:25 PM Alex Guziel  > invalid>
> > wrote:
> >
> > > I'm not sure if that one went away but there are different SQL
> operators,
> > > like MySqlOperator, MsSqlOperator, etc. that I see.
> > >
> > > Best,
> > > Alex
> > >
> > > On Fri, Mar 17, 2017 at 7:56 PM, Ruslan Dautkhanov <
> dautkha...@gmail.com
> > >
> > > wrote:
> > >
> > > > I can't find references to SQLOperator neither in the source code nor
> > in
> > > > the API Reference.
> > > >
> > > > Although it is mentioned in Concepts page :
> > > >
> > > > https://github.com/apache/incubator-airflow/blob/master/
> > > > docs/concepts.rst#operators
> > > >
> > > >
> > > >
> > > >- SqlOperator - executes a SQL command
> > > >
> > > > Sorry for basic questions - just started using Airflow this week.
> > > > Did it got replaced with something else? If so what that is?
> > > >
> > > >
> > > >
> > > > Thanks,
> > > > Ruslan Dautkhanov
> > > >
> > >
> >
>


Re: SQLOperator?

2017-03-18 Thread Ruslan Dautkhanov
Does this documentation change look good ?

https://github.com/apache/incubator-airflow/pull/2168/files




-- 
Ruslan Dautkhanov

On Sat, Mar 18, 2017 at 2:59 PM, Ruslan Dautkhanov 
wrote:

> I just noticed that there is already a Oracle_Operator
>
> https://github.com/apache/incubator-airflow/blob/master/
> airflow/operators/oracle_operator.py
>
> "Concepts" page has to be updated probably to reflect that there is no
> SQLOpartor.
>
> Yep, I have cx_Oracle installed.
>
> Thanks Alex and Boris.
>
>
>
> --
> Ruslan Dautkhanov
>
> On Sat, Mar 18, 2017 at 12:20 PM, Boris Tyukin 
> wrote:
>
>> >>I guess I will have to call OracleHook from PythonOperator?
>>
>> that's what I do! works fine. do not forget to install oracle_cx library
>> https://pypi.python.org/pypi/cx_Oracle/5.2.1
>>
>> you can also create your own operator using one of existing ones as an
>> example
>>
>> On Fri, Mar 17, 2017 at 11:49 PM, Ruslan Dautkhanov > >
>> wrote:
>>
>> > Thanks Alex
>> >
>> > I did notice MySQL one
>> >
>> > We are working with Oracle
>> >
>> > I guess I will have to call OracleHook from PythonOperator?
>> >
>> >
>> >
>> > On Fri, Mar 17, 2017 at 9:25 PM Alex Guziel > > invalid>
>> > wrote:
>> >
>> > > I'm not sure if that one went away but there are different SQL
>> operators,
>> > > like MySqlOperator, MsSqlOperator, etc. that I see.
>> > >
>> > > Best,
>> > > Alex
>> > >
>> > > On Fri, Mar 17, 2017 at 7:56 PM, Ruslan Dautkhanov <
>> dautkha...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > I can't find references to SQLOperator neither in the source code
>> nor
>> > in
>> > > > the API Reference.
>> > > >
>> > > > Although it is mentioned in Concepts page :
>> > > >
>> > > > https://github.com/apache/incubator-airflow/blob/master/
>> > > > docs/concepts.rst#operators
>> > > >
>> > > >
>> > > >
>> > > >- SqlOperator - executes a SQL command
>> > > >
>> > > > Sorry for basic questions - just started using Airflow this week.
>> > > > Did it got replaced with something else? If so what that is?
>> > > >
>> > > >
>> > > >
>> > > > Thanks,
>> > > > Ruslan Dautkhanov
>> > > >
>> > >
>> >
>>
>
>


Re: SparkOperator - tips and feedback?

2017-03-18 Thread Ruslan Dautkhanov
+1 Great idea.

my two cents - it would be nice (as an option) if SparkOperator would be
able to keep context open between different calls,
as it takes 30+ seconds to create a new context (on our cluster). Not sure
how well it fits Airflow architecture.



-- 
Ruslan Dautkhanov

On Sat, Mar 18, 2017 at 3:45 PM, Russell Jurney 
wrote:

> What do people think about creating a SparkOperator that uses spark-submit
> to submit jobs? Would work for Scala/Java Spark and PySpark. The patterns
> outlined in my presentation on Airflow and PySpark
> <http://bit.ly/airflow_pyspark> would fit well inside an Operator, I
> think.
> BashOperator works, but why not tailor something to spark-submit?
>
> I'm open to doing the work, but I wanted to see what people though about it
> and get feedback about things they would like to see in SparkOperator and
> get any pointers people had to doing the implementation.
>
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>


Re: Reminder : LatestOnlyOperator

2017-03-18 Thread Ruslan Dautkhanov
Is there is a way to combine scheduling behavior operators  (like this
LatestOnlyOperator)
with a functional operator (like Oracle_Operator)? I was thinking multiple
inheritance would do,like

> class Oracle_LatestOnly_Operator (Oracle_Operator, LatestOnlyOperator):
> ...

I might be overthinking this and there could be a simpler way?
Sorry, I am still learning Airflow concepts...

Thanks.



-- 
Ruslan Dautkhanov

On Sat, Mar 18, 2017 at 2:15 PM, Boris Tyukin  wrote:

> Thanks George for that feature!
>
> sure, just created a jira on this
> https://issues.apache.org/jira/browse/AIRFLOW-1008
>
>
> On Sat, Mar 18, 2017 at 12:05 PM, siddharth anand 
> wrote:
>
> > Thx Boris . Credit goes to George (gwax) for the implementation of the
> > LatestOnlyOperator.
> >
> > Boris,
> > Can you describe what you mean in a Jira?
> > -s
> >
> > On Fri, Mar 17, 2017 at 6:02 PM, Boris Tyukin 
> > wrote:
> >
> > > this is nice indeed along with the new catchup option
> > > https://airflow.incubator.apache.org/scheduler.html#
> backfill-and-catchup
> > >
> > > Thanks Sid and Ben for adding these new options!
> > >
> > > for a complete picture, it would be nice to force only one dag run at
> the
> > > time.
> > >
> > > On Fri, Mar 17, 2017 at 7:33 PM, siddharth anand 
> > > wrote:
> > >
> > > > With the Apache Airflow 1.8 release imminent, you may want to try out
> > the
> > > >
> > > > *LatestOnlyOperator.*
> > > >
> > > > If you want your DAG to only run on the most recent scheduled slot,
> > > > regardless of backlog, this operator will skip running downstream
> tasks
> > > for
> > > > all DAG Runs prior to the current time slot.
> > > >
> > > > For example, I might have a DAG that takes a DB snapshot once a day.
> It
> > > > might be that I paused that DAG for 2 weeks or that I had set the
> start
> > > > date to a fixed data 2 weeks in the past. When I enable my DAG, I
> don't
> > > > want it to run 14 days' worth of snapshots for the current state of
> the
> > > DB
> > > > -- that's unnecessary work.
> > > >
> > > > The LatestOnlyOperator avoids that work.
> > > >
> > > > https://github.com/apache/incubator-airflow/commit/
> > > > edf033be65b575f44aa221d5d0ec9ecb6b32c67a
> > > >
> > > > With it, you can simply use
> > > > latest_only = LatestOnlyOperator(task_id='latest_only', dag=dag)
> > > >
> > > > instead of
> > > > def skip_to_current_job(ds, **kwargs):
> > > > now = datetime.now()
> > > > left_window = kwargs['dag'].following_
> schedule(kwargs['execution_
> > > > date'])
> > > > right_window = kwargs['dag'].following_schedule(left_window)
> > > > logging.info(('Left Window {}, Now {}, Right Window
> > > > {}').format(left_window,now,right_window))
> > > > if not now <= right_window:
> > > > logging.info('Not latest execution, skipping downstream.')
> > > > return False
> > > > return True
> > > >
> > > > short_circuit = ShortCircuitOperator(
> > > >   task_id = 'short_circuit_if_not_current_job',
> > > >   provide_context = True,
> > > >   python_callable = skip_to_current_job,
> > > >   dag = dag
> > > > )
> > > >
> > > > -s
> > > >
> > >
> >
>


Re: SparkOperator - tips and feedback?

2017-03-18 Thread Ruslan Dautkhanov
Thanks Russel.

Yep, I meant SparkContext (or SparkSession in Spark2).
It's not only about startup time of a spark session (30 seconds delay for
each taks is still a lot).

Another benfit is - it'll be super useful if we decide to cache() a
dataframe.
Then this could be a huge gain for other tasks that may need that (cached)
dataframe.

Without an option to share spark session, each dag's task has to
(1) restart spark context,
and (2) recache dataframes needed in a workflow.
It'll be a major slowdown for a Spark job.

>> but an InteractiveSparkContext or SparkConsoleContext might be able to
do this?

I couldn't find InteractiveSparkContext or SparkConsoleContext in Airflow
nor in Spark .. please elaborate.

Thanks again.



-- 
Ruslan Dautkhanov

On Sat, Mar 18, 2017 at 7:43 PM, Russell Jurney 
wrote:

> Ruslan, thanks for your feedback.
>
> You mean the spark-submit context? Or like the SparkContext and
> SparkSession? I don't think we could keep that alive, because it wouldn't
> work out with multiple calls to spark-submit. I do feel your pain, though.
> Maybe someone else can see how this might be done?
>
> If SparkContext was able to open the spark/pyspark console, then multiple
> job submissions would be possible. I didn't have this in mind but an
> InteractiveSparkContext or SparkConsoleContext might be able to do this?
>
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
> On Sat, Mar 18, 2017 at 3:02 PM, Ruslan Dautkhanov 
> wrote:
>
> > +1 Great idea.
> >
> > my two cents - it would be nice (as an option) if SparkOperator would be
> > able to keep context open between different calls,
> > as it takes 30+ seconds to create a new context (on our cluster). Not
> sure
> > how well it fits Airflow architecture.
> >
> >
> >
> > --
> > Ruslan Dautkhanov
> >
> > On Sat, Mar 18, 2017 at 3:45 PM, Russell Jurney <
> russell.jur...@gmail.com>
> > wrote:
> >
> > > What do people think about creating a SparkOperator that uses
> > spark-submit
> > > to submit jobs? Would work for Scala/Java Spark and PySpark. The
> patterns
> > > outlined in my presentation on Airflow and PySpark
> > > <http://bit.ly/airflow_pyspark> would fit well inside an Operator, I
> > > think.
> > > BashOperator works, but why not tailor something to spark-submit?
> > >
> > > I'm open to doing the work, but I wanted to see what people though
> about
> > it
> > > and get feedback about things they would like to see in SparkOperator
> and
> > > get any pointers people had to doing the implementation.
> > >
> > > Russell Jurney @rjurney <http://twitter.com/rjurney>
> > > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> > > <http://facebook.com/jurney> datasyndrome.com
> > >
> >
>


Re: SparkOperator - tips and feedback?

2017-03-18 Thread Ruslan Dautkhanov
Thanks Bolke. That's awesome.

1)
So each task would creates its own spark session?
Is there is a way to have spark session sharing like discussed in this
email chain?

2)
Looks like SparkSqlHook calls `spark-sql` shell with all those parameters?

https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/spark_sql_hook.py#L88

This probably will not work in Cloudera's distribution of Spark.
I think they stopped shipping `spark-sql` since CDH 5.4.
`spark-sql` is not included because CDH Spark doesn't have Thift service or
because of some other reason.

Thank you.



-- 
Ruslan Dautkhanov

On Sat, Mar 18, 2017 at 8:24 PM, Bolke de Bruin  wrote:

> A spark operator exists as of 1.8.0 (which will be released tomorrow), you
> might want to take a look at that. I know that an update is coming to that
> operator that improves communication with Yarn.
>
> Bolke
>
> > On 18 Mar 2017, at 18:43, Russell Jurney 
> wrote:
> >
> > Ruslan, thanks for your feedback.
> >
> > You mean the spark-submit context? Or like the SparkContext and
> > SparkSession? I don't think we could keep that alive, because it wouldn't
> > work out with multiple calls to spark-submit. I do feel your pain,
> though.
> > Maybe someone else can see how this might be done?
> >
> > If SparkContext was able to open the spark/pyspark console, then multiple
> > job submissions would be possible. I didn't have this in mind but an
> > InteractiveSparkContext or SparkConsoleContext might be able to do this?
> >
> > Russell Jurney @rjurney <http://twitter.com/rjurney>
> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> > <http://facebook.com/jurney> datasyndrome.com
> >
> > On Sat, Mar 18, 2017 at 3:02 PM, Ruslan Dautkhanov  >
> > wrote:
> >
> >> +1 Great idea.
> >>
> >> my two cents - it would be nice (as an option) if SparkOperator would be
> >> able to keep context open between different calls,
> >> as it takes 30+ seconds to create a new context (on our cluster). Not
> sure
> >> how well it fits Airflow architecture.
> >>
> >>
> >>
> >> --
> >> Ruslan Dautkhanov
> >>
> >> On Sat, Mar 18, 2017 at 3:45 PM, Russell Jurney <
> russell.jur...@gmail.com>
> >> wrote:
> >>
> >>> What do people think about creating a SparkOperator that uses
> >> spark-submit
> >>> to submit jobs? Would work for Scala/Java Spark and PySpark. The
> patterns
> >>> outlined in my presentation on Airflow and PySpark
> >>> <http://bit.ly/airflow_pyspark> would fit well inside an Operator, I
> >>> think.
> >>> BashOperator works, but why not tailor something to spark-submit?
> >>>
> >>> I'm open to doing the work, but I wanted to see what people though
> about
> >> it
> >>> and get feedback about things they would like to see in SparkOperator
> and
> >>> get any pointers people had to doing the implementation.
> >>>
> >>> Russell Jurney @rjurney <http://twitter.com/rjurney>
> >>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> >>> <http://facebook.com/jurney> datasyndrome.com
> >>>
> >>
>
>


Re: Reminder : LatestOnlyOperator

2017-03-20 Thread Ruslan Dautkhanov
Thanks Boris. It does make sense.
Although how it's different from depends_on_past task-level parameter?
In both cases, a task will be skipped if there is another TI of this task
is still running (from a previous dagrun), right?


Thanks,
Ruslan


On Sat, Mar 18, 2017 at 7:11 PM, Boris Tyukin  wrote:

> you would just chain them - there is an example that came with airflow 1.8
> https://github.com/apache/incubator-airflow/blob/master/
> airflow/example_dags/example_latest_only.py
>
> so in your case, instead of dummy operator, you would use your Oracle
> operator.
>
> Does it make sense?
>
>
> On Sat, Mar 18, 2017 at 7:12 PM, Ruslan Dautkhanov 
> wrote:
>
> > Is there is a way to combine scheduling behavior operators  (like this
> > LatestOnlyOperator)
> > with a functional operator (like Oracle_Operator)? I was thinking
> multiple
> > inheritance would do,like
> >
> > > class Oracle_LatestOnly_Operator (Oracle_Operator, LatestOnlyOperator):
> > > ...
> >
> > I might be overthinking this and there could be a simpler way?
> > Sorry, I am still learning Airflow concepts...
> >
> > Thanks.
> >
> >
> >
> > --
> > Ruslan Dautkhanov
> >
> > On Sat, Mar 18, 2017 at 2:15 PM, Boris Tyukin 
> > wrote:
> >
> > > Thanks George for that feature!
> > >
> > > sure, just created a jira on this
> > > https://issues.apache.org/jira/browse/AIRFLOW-1008
> > >
> > >
> > > On Sat, Mar 18, 2017 at 12:05 PM, siddharth anand 
> > > wrote:
> > >
> > > > Thx Boris . Credit goes to George (gwax) for the implementation of
> the
> > > > LatestOnlyOperator.
> > > >
> > > > Boris,
> > > > Can you describe what you mean in a Jira?
> > > > -s
> > > >
> > > > On Fri, Mar 17, 2017 at 6:02 PM, Boris Tyukin  >
> > > > wrote:
> > > >
> > > > > this is nice indeed along with the new catchup option
> > > > > https://airflow.incubator.apache.org/scheduler.html#
> > > backfill-and-catchup
> > > > >
> > > > > Thanks Sid and Ben for adding these new options!
> > > > >
> > > > > for a complete picture, it would be nice to force only one dag run
> at
> > > the
> > > > > time.
> > > > >
> > > > > On Fri, Mar 17, 2017 at 7:33 PM, siddharth anand <
> san...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > With the Apache Airflow 1.8 release imminent, you may want to try
> > out
> > > > the
> > > > > >
> > > > > > *LatestOnlyOperator.*
> > > > > >
> > > > > > If you want your DAG to only run on the most recent scheduled
> slot,
> > > > > > regardless of backlog, this operator will skip running downstream
> > > tasks
> > > > > for
> > > > > > all DAG Runs prior to the current time slot.
> > > > > >
> > > > > > For example, I might have a DAG that takes a DB snapshot once a
> > day.
> > > It
> > > > > > might be that I paused that DAG for 2 weeks or that I had set the
> > > start
> > > > > > date to a fixed data 2 weeks in the past. When I enable my DAG, I
> > > don't
> > > > > > want it to run 14 days' worth of snapshots for the current state
> of
> > > the
> > > > > DB
> > > > > > -- that's unnecessary work.
> > > > > >
> > > > > > The LatestOnlyOperator avoids that work.
> > > > > >
> > > > > > https://github.com/apache/incubator-airflow/commit/
> > > > > > edf033be65b575f44aa221d5d0ec9ecb6b32c67a
> > > > > >
> > > > > > With it, you can simply use
> > > > > > latest_only = LatestOnlyOperator(task_id='latest_only', dag=dag)
> > > > > >
> > > > > > instead of
> > > > > > def skip_to_current_job(ds, **kwargs):
> > > > > > now = datetime.now()
> > > > > > left_window = kwargs['dag'].following_
> > > schedule(kwargs['execution_
> > > > > > date'])
> > > > > > right_window = kwargs['dag'].following_schedule(left_window)
> > > > > > logging.info(('Left Window {}, Now {}, Right Window
> > > > > > {}').format(left_window,now,right_window))
> > > > > > if not now <= right_window:
> > > > > > logging.info('Not latest execution, skipping
> downstream.')
> > > > > > return False
> > > > > > return True
> > > > > >
> > > > > > short_circuit = ShortCircuitOperator(
> > > > > >   task_id = 'short_circuit_if_not_current_job',
> > > > > >   provide_context = True,
> > > > > >   python_callable = skip_to_current_job,
> > > > > >   dag = dag
> > > > > > )
> > > > > >
> > > > > > -s
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: 1.8.1 release

2017-03-21 Thread Ruslan Dautkhanov
Some of the issues I ran into while testing 1.8rc5 :

https://issues.apache.org/jira/browse/AIRFLOW-1015
> https://issues.apache.org/jira/browse/AIRFLOW-1013
> https://issues.apache.org/jira/browse/AIRFLOW-1004
> https://issues.apache.org/jira/browse/AIRFLOW-1003
> https://issues.apache.org/jira/browse/AIRFLOW-1001
> https://issues.apache.org/jira/browse/AIRFLOW-1015


It would be great to have at least some of them fixed in 1.8.1.

Thank you.




-- 
Ruslan Dautkhanov

On Tue, Mar 21, 2017 at 3:02 PM, Dan Davydov  wrote:

> Here is my list for targeted 1.8.1 fixes:
> https://issues.apache.org/jira/browse/AIRFLOW-982
> https://issues.apache.org/jira/browse/AIRFLOW-983
> https://issues.apache.org/jira/browse/AIRFLOW-1019 (and in general the
> slow
> startup time from this new logic of orphaned/reset task)
> https://issues.apache.org/jira/browse/AIRFLOW-1017 (which I will hopefully
> have a fix out for soon just finishing up tests)
>
> We are also hitting a new issue with subdags with rc5 that we weren't
> hitting with rc4 where subdags will occasionally just hang (had to roll
> back from rc5 to rc4), I'll try to spin up a JIRA for it soon which should
> be on the list too.
>
>
> On Tue, Mar 21, 2017 at 1:54 PM, Chris Riccomini 
> wrote:
>
> > Agreed. I'm looking for a list of checksums/JIRAs that we want in the
> > bugfix release.
> >
> > On Tue, Mar 21, 2017 at 12:54 PM, Bolke de Bruin 
> > wrote:
> >
> > >
> > >
> > > > On 21 Mar 2017, at 12:51, Bolke de Bruin  wrote:
> > > >
> > > > My suggestion, as we are using semantic versioning is:
> > > >
> > > > 1) no new features in the 1.8 branch
> > > > 2) only bug fixes in the 1.8 branch
> > > > 3) new features to land in 1.9
> > > >
> > > > This allows companies to
> > >
> > > Have a "known" version and can move to the new branch when they want to
> > > get new features. Obviously we only support N-1, so when 1.10 comes out
> > we
> > > stop supporting 1.8.X.
> > >
> > > >
> > > > Sent from my iPhone
> > > >
> > > >> On 21 Mar 2017, at 11:22, Chris Riccomini 
> > > wrote:
> > > >>
> > > >> Hey all,
> > > >>
> > > >> I suggest that we start a 1.8.1 Airflow release now. The goal would
> > be:
> > > >>
> > > >> 1) get a second release under our belt
> > > >> 2) patch known issues with the 1.8.0 release
> > > >>
> > > >> I'm happy to run it, but I saw Maxime mentioning that Airbnb might
> > want
> > > to.
> > > >> @Max et al, can you comment?
> > > >>
> > > >> Also, can folks supply JIRAs for stuff that think needs to be in the
> > > 1.8.1
> > > >> bugfix release?
> > > >>
> > > >> Cheers,
> > > >> Chris
> > >
> >
>


Re: Reminder : LatestOnlyOperator

2017-03-21 Thread Ruslan Dautkhanov
Thank you for the detailed explanation Boris.


Best regards,

Ruslan Dautkhanov

On Mon, Mar 20, 2017 at 12:12 PM, Boris Tyukin 
wrote:

> depends_on_past is looking at previous task instance which sounds the same
> as "latestonly" but the difference becomes apparent if you look at this
> example.
>
> Let's say you have a dag, scheduled to run every day and it has been
> failing for the past 3 days. The whole purpose of that dag is to populate
> snapshot table or do a daily backup.  If you use depends on past, you would
> have to rerun all missed runs or mark them as successful eventually doing
> useless work (3 daily snapshots or backups for the same data).
>
> LatestOnly allows you to bypass missed runs and just do it once for most
> recent instance.
>
> Another difference, depends on past is tricky if you use BranchOperator
> because some branches may not run one day and run another - it will really
> mess up your logic.
>
> On Mon, Mar 20, 2017 at 12:45 PM, Ruslan Dautkhanov 
> wrote:
>
> > Thanks Boris. It does make sense.
> > Although how it's different from depends_on_past task-level parameter?
> > In both cases, a task will be skipped if there is another TI of this task
> > is still running (from a previous dagrun), right?
> >
> >
> > Thanks,
> > Ruslan
> >
> >
> > On Sat, Mar 18, 2017 at 7:11 PM, Boris Tyukin 
> > wrote:
> >
> > > you would just chain them - there is an example that came with airflow
> > 1.8
> > > https://github.com/apache/incubator-airflow/blob/master/
> > > airflow/example_dags/example_latest_only.py
> > >
> > > so in your case, instead of dummy operator, you would use your Oracle
> > > operator.
> > >
> > > Does it make sense?
> > >
> > >
> > > On Sat, Mar 18, 2017 at 7:12 PM, Ruslan Dautkhanov <
> dautkha...@gmail.com
> > >
> > > wrote:
> > >
> > > > Is there is a way to combine scheduling behavior operators  (like
> this
> > > > LatestOnlyOperator)
> > > > with a functional operator (like Oracle_Operator)? I was thinking
> > > multiple
> > > > inheritance would do,like
> > > >
> > > > > class Oracle_LatestOnly_Operator (Oracle_Operator,
> > LatestOnlyOperator):
> > > > > ...
> > > >
> > > > I might be overthinking this and there could be a simpler way?
> > > > Sorry, I am still learning Airflow concepts...
> > > >
> > > > Thanks.
> > > >
> > > >
> > > >
> > > > --
> > > > Ruslan Dautkhanov
> > > >
> > > > On Sat, Mar 18, 2017 at 2:15 PM, Boris Tyukin  >
> > > > wrote:
> > > >
> > > > > Thanks George for that feature!
> > > > >
> > > > > sure, just created a jira on this
> > > > > https://issues.apache.org/jira/browse/AIRFLOW-1008
> > > > >
> > > > >
> > > > > On Sat, Mar 18, 2017 at 12:05 PM, siddharth anand <
> san...@apache.org
> > >
> > > > > wrote:
> > > > >
> > > > > > Thx Boris . Credit goes to George (gwax) for the implementation
> of
> > > the
> > > > > > LatestOnlyOperator.
> > > > > >
> > > > > > Boris,
> > > > > > Can you describe what you mean in a Jira?
> > > > > > -s
> > > > > >
> > > > > > On Fri, Mar 17, 2017 at 6:02 PM, Boris Tyukin <
> > bo...@boristyukin.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > this is nice indeed along with the new catchup option
> > > > > > > https://airflow.incubator.apache.org/scheduler.html#
> > > > > backfill-and-catchup
> > > > > > >
> > > > > > > Thanks Sid and Ben for adding these new options!
> > > > > > >
> > > > > > > for a complete picture, it would be nice to force only one dag
> > run
> > > at
> > > > > the
> > > > > > > time.
> > > > > > >
> > > > > > > On Fri, Mar 17, 2017 at 7:33 PM, siddharth anand <
> > > san...@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > With the Apache Airflow 1.8 release imminent, you may want to
> > try
> >

Re: 1.8.1 release

2017-03-22 Thread Ruslan Dautkhanov
Thank you Sid!


Best regards,
Ruslan

On Wed, Mar 22, 2017 at 12:01 AM, siddharth anand  wrote:

> Ruslan,
> Thanks for sharing this list. I can pick a few up. I agree we should aim to
> get some of them into 1.8.1.
>
> -s
>
> On Tue, Mar 21, 2017 at 2:29 PM, Ruslan Dautkhanov 
> wrote:
>
> > Some of the issues I ran into while testing 1.8rc5 :
> >
> > https://issues.apache.org/jira/browse/AIRFLOW-1015
> > > https://issues.apache.org/jira/browse/AIRFLOW-1013
> > > https://issues.apache.org/jira/browse/AIRFLOW-1004
> > > https://issues.apache.org/jira/browse/AIRFLOW-1003
> > > https://issues.apache.org/jira/browse/AIRFLOW-1001
> > > https://issues.apache.org/jira/browse/AIRFLOW-1015
> >
> >
> > It would be great to have at least some of them fixed in 1.8.1.
> >
> > Thank you.
> >
> >
> >
> >
> > --
> > Ruslan Dautkhanov
> >
> > On Tue, Mar 21, 2017 at 3:02 PM, Dan Davydov  > invalid
> > > wrote:
> >
> > > Here is my list for targeted 1.8.1 fixes:
> > > https://issues.apache.org/jira/browse/AIRFLOW-982
> > > https://issues.apache.org/jira/browse/AIRFLOW-983
> > > https://issues.apache.org/jira/browse/AIRFLOW-1019 (and in general the
> > > slow
> > > startup time from this new logic of orphaned/reset task)
> > > https://issues.apache.org/jira/browse/AIRFLOW-1017 (which I will
> > hopefully
> > > have a fix out for soon just finishing up tests)
> > >
> > > We are also hitting a new issue with subdags with rc5 that we weren't
> > > hitting with rc4 where subdags will occasionally just hang (had to roll
> > > back from rc5 to rc4), I'll try to spin up a JIRA for it soon which
> > should
> > > be on the list too.
> > >
> > >
> > > On Tue, Mar 21, 2017 at 1:54 PM, Chris Riccomini <
> criccom...@apache.org>
> > > wrote:
> > >
> > > > Agreed. I'm looking for a list of checksums/JIRAs that we want in the
> > > > bugfix release.
> > > >
> > > > On Tue, Mar 21, 2017 at 12:54 PM, Bolke de Bruin 
> > > > wrote:
> > > >
> > > > >
> > > > >
> > > > > > On 21 Mar 2017, at 12:51, Bolke de Bruin 
> > wrote:
> > > > > >
> > > > > > My suggestion, as we are using semantic versioning is:
> > > > > >
> > > > > > 1) no new features in the 1.8 branch
> > > > > > 2) only bug fixes in the 1.8 branch
> > > > > > 3) new features to land in 1.9
> > > > > >
> > > > > > This allows companies to
> > > > >
> > > > > Have a "known" version and can move to the new branch when they
> want
> > to
> > > > > get new features. Obviously we only support N-1, so when 1.10 comes
> > out
> > > > we
> > > > > stop supporting 1.8.X.
> > > > >
> > > > > >
> > > > > > Sent from my iPhone
> > > > > >
> > > > > >> On 21 Mar 2017, at 11:22, Chris Riccomini <
> criccom...@apache.org>
> > > > > wrote:
> > > > > >>
> > > > > >> Hey all,
> > > > > >>
> > > > > >> I suggest that we start a 1.8.1 Airflow release now. The goal
> > would
> > > > be:
> > > > > >>
> > > > > >> 1) get a second release under our belt
> > > > > >> 2) patch known issues with the 1.8.0 release
> > > > > >>
> > > > > >> I'm happy to run it, but I saw Maxime mentioning that Airbnb
> might
> > > > want
> > > > > to.
> > > > > >> @Max et al, can you comment?
> > > > > >>
> > > > > >> Also, can folks supply JIRAs for stuff that think needs to be in
> > the
> > > > > 1.8.1
> > > > > >> bugfix release?
> > > > > >>
> > > > > >> Cheers,
> > > > > >> Chris
> > > > >
> > > >
> > >
> >
>


Re: Upload DAG / owner / multi-tenancy

2017-03-31 Thread Ruslan Dautkhanov
It would be interesting to see how Kerberos setup will be implemented in
multi-tenant Airflow. Would it be just one keytab for a "service account"
and then authentication will impersonate to actual users? Similar approach
is done by some other projects, I believe, for example, Livy, Hue and some
others.


On Thu, Mar 30, 2017 at 7:03 PM Bolke de Bruin  wrote:

> No not yet. It is planned for maybe 1.9 or 1.10. You can do this with git
> or any other artifact manager though.
>
> B.
>
> Sent from my iPhone
>
> > On 30 Mar 2017, at 16:36, Thomas Weise  wrote:
> >
> > Hi,
> >
> > Dropping files into dags_folder is a convenient way to deploy new DAGs,
> but
> > this way they are owned by the respective OS user. I'm looking for a
> > multi-tenant setup with multiple "owners", where each can add new DAGs
> and
> > see only their DAGs in the UI.
> >
> > Is this supported?
> >
> > Thanks,
> > Thomas
>


Re: 1.8.1 release

2017-04-05 Thread Ruslan Dautkhanov
https://issues.apache.org/jira/browse/AIRFLOW-1033

Is also 1.8 and not 1.8.0 blocker ?

Thanks.



-- 
Ruslan Dautkhanov

On Mon, Apr 3, 2017 at 3:20 PM, Bolke de Bruin  wrote:

> Hi Chris,
>
> Please add:
>
> https://issues.apache.org/jira/browse/AIRFLOW-1011 <
> https://issues.apache.org/jira/browse/AIRFLOW-1011> (PR:
> https://github.com/apache/incubator-airflow/pull/2179 <
> https://github.com/apache/incubator-airflow/pull/2179>)
>
> To the blocker list. It was missed due to the “Airflow 1.8” affected
> version tag instead of “1.8.0”. I fixed that for you.
>
> Bolke.
>
> > On 3 Apr 2017, at 22:06, Bolke de Bruin  wrote:
> >
> > I have a PR out for
> >
> > * AIRFLOW-1001: https://github.com/apache/incubator-airflow/pull/2213 <
> https://github.com/apache/incubator-airflow/pull/2213>
> > * AIRFLOW-1018: https://github.com/apache/incubator-airflow/pull/2212 <
> https://github.com/apache/incubator-airflow/pull/2212>
> >
> > Furthermore, Sid, I checked AIRFLOW-1053 and while it is annoying I
> don’t think it is a blocker: it happens only with @once dags that have a
> SLA, hardly very common. Nevertheless a fix would be nice obviously.
> >
> > Bolke
> >
> >> On 3 Apr 2017, at 11:05, Bolke de Bruin  bdbr...@gmail.com>> wrote:
> >>
> >> Hey Chris,
> >>
> >> AIRFLOW-1000 has a PR out. You might want to discuss it on the list
> what the impact is though.
> >> AIRFLOW-1018 I’ll have a PR out for shortly that allows “stdout” for
> the scheduler log files. I did downgrade from blocker to critical.
> >> AIRFLOW-719 Has a PR, including much needed increased test coverage,
> that I am pretty sure is working, but needs verification (plz @Alex)
> >>
> >> I would downgrade AIRFLOW-1019 to critical - especially as Dan is not
> around at the moment.
> >>
> >> Furthermore, do you have any timelines for getting to the release? Can
> I help in any way? You might want to chase a couple of times ;-).
> >>
> >> Bolke
> >>
> >>> On 30 Mar 2017, at 19:48, siddharth anand  san...@apache.org>> wrote:
> >>>
> >>> Chris,
> >>> I've submitted PRs for :
> >>>
> >>>  - PR [AIRFLOW-1013] :
> >>>  https://github.com/apache/incubator-airflow/pull/2203 <
> https://github.com/apache/incubator-airflow/pull/2203>
> >>>  - PR [AIRFLOW-1054]:
> >>>  https://github.com/apache/incubator-airflow/pull/2201 <
> https://github.com/apache/incubator-airflow/pull/2201>
> >>>
> >>> And filed a blocker for a new issue. Essentially, @once DAGs cannot be
> >>> created if catchup=False :
> >>> https://issues.apache.org/jira/browse/AIRFLOW-1055 <
> https://issues.apache.org/jira/browse/AIRFLOW-1055>
> >>>
> >>> I have a PR that works for this, but will need to add unit tests for
> it as
> >>> well as for AIRFLOW-1013.
> >>>
> >>> -s
> >>>
> >>> On Wed, Mar 29, 2017 at 3:24 PM, siddharth anand 
> wrote:
> >>>
> >>>> Didn't realize https://issues.apache.org/jira/browse/AIRFLOW-1013
> was a
> >>>> blocker. I will have a PR shortly.
> >>>> -s
> >>>>
> >>>> On Wed, Mar 29, 2017 at 2:07 PM, Chris Riccomini <
> criccom...@apache.org>
> >>>> wrote:
> >>>>
> >>>>> The following three JIRAs were not merged into the v1-8-test branch,
> but
> >>>>> are listed as part of the 1.8.1 release:
> >>>>>
> >>>>> AIRFLOW-1017 b2b9587cca9195229ab107394ad94b7702c70e37
> >>>>> AIRFLOW-906 bc47200711be4d2c0b36b772651dae4f5e01a204
> >>>>> AIRFLOW-858 94dc7fb0a6bb3c563d9df6566cd52a59bd0c4629
> >>>>> AIRFLOW-832 b0ae70d3a8e935dc9266b6853683ae5375a7390b
> >>>>>
> >>>>> I'm going to merge them in now.
> >>>>>
> >>>>> On Wed, Mar 29, 2017 at 1:53 PM, Chris Riccomini <
> criccom...@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> Hey Bolke,
> >>>>>>
> >>>>>> Great. Assuming your PR is committed, that leaves five blockers:
> >>>>>>
> >>>>>> https://issues.apache.org/jira/browse/AIRFLOW-1000
> >>>>>> https://issues.apache.org/jira/browse/AIRFLOW-1001
> >>>>>> https://issues.apache.org/jira/browse/AIRFLOW-101

Re: Help needed in building airflow

2017-04-10 Thread Ruslan Dautkhanov
Sachin,

Do you just need to install Airflow?
Check https://airflow.incubator.apache.org/installation.html

If it has to from github, it can be something like
pip install -e git://
github.com/apache/incubator-airflow.git#egg=airflow[crypto,jdbc,hdfs,hive,kerberos,ldap,password]




-- 
Ruslan Dautkhanov

On Sun, Apr 9, 2017 at 10:17 PM, Sachin  wrote:

> Hi,
>
> I am new to airflow and trying to build it. I checked out the coed from git
> and need help in building it.
>
> Can someone please send the steps to build airflow code?
>
> Thanks,
> Sachin
>


regression from a month-old aitflow version

2017-05-06 Thread Ruslan Dautkhanov
I've upgraded Airflow to today's master branch.

Got following regression in attempt to start a DAG:

Process DagFileProcessor209-Process:
> Traceback (most recent call last):
>   File
> "/opt/cloudera/parcels/Anaconda/lib/python2.7/multiprocessing/process.py",
> line 258, in _bootstrap
> self.run()
>   File
> "/opt/cloudera/parcels/Anaconda/lib/python2.7/multiprocessing/process.py",
> line 114, in run
> self._target(*self._args, **self._kwargs)
>   File "/opt/airflow/airflow-20170506/src/airflow/airflow/jobs.py", line
> 346, in helper
> pickle_dags)
>   File "/opt/airflow/airflow-20170506/src/airflow/airflow/utils/db.py",
> line 48, in wrapper
> result = func(*args, **kwargs)
>   File "/opt/airflow/airflow-20170506/src/airflow/airflow/jobs.py", line
> 1584, in process_file
> self._process_dags(dagbag, dags, ti_keys_to_schedule)
>   File "/opt/airflow/airflow-20170506/src/airflow/airflow/jobs.py", line
> 1173, in _process_dags
> dag_run = self.create_dag_run(dag)
>   File "/opt/airflow/airflow-20170506/src/airflow/airflow/utils/db.py",
> line 48, in wrapper
> result = func(*args, **kwargs)
>   File "/opt/airflow/airflow-20170506/src/airflow/airflow/jobs.py", line
> 776, in create_dag_run
> if next_start <= now:
> TypeError: can't compare datetime.datetime to NoneType



DAG definition:

main_dag = DAG(
> dag_id = 'DISCOVER-Oracle-Load-Mar2017-v1',
> default_args   = default_args,  # dafeult 
> operators' arguments - see above
> user_defined_macros= dag_macros,   # I do not get 
> different between
> ## params = dag_macros,   # 
> user_defined_macros and params
> #
> start_date = datetime.now(),# or e.g. 
> datetime(2015, 6, 1)
> # 'end_date'   = datetime(2016, 1, 1),
> catchup= False, # Perform 
> scheduler catchup (or only run latest)?
> # - 
> defaults to True
> schedule_interval  = '@once',   # 
> '@once'=None?
>  # 
> doesn't create multiple dag runs automatically
> concurrency= 3, # task 
> instances allowed to run concurrently
> max_active_runs= 1, # only 
> one DAG run at a time
> dagrun_timeout = timedelta(days=4), # no way 
> this dag should ran for 4 days
> orientation= 'TB',  # default 
> graph view
> )


default_args:

default_args = {
> # Security:
> 'owner': 'rdautkha',# owner 
> of the task, using the unix username is recommended
> # 'run_as_user': None   # unix 
> username to impersonate while running the task
> # Scheduling:
> 'start_date'   : None,  # don't 
> confuse with DAG's start_date
> 'depends_on_past'  : False, # True 
> makes sense... but there are bugs around that code
> 'wait_for_downstream'  : False, # 
> depends_on_past is forced to True if wait_for_downstream
> 'trigger_rule' : 'all_success', # 
> all_succcess is default anyway
> # Retries
> 'retries'  : 0, # No 
> retries
> # 'retry_delay': timedelta(minutes=5),  # check 
> retry_exponential_backoff and max_retry_delay too
> # Timeouts and SLAs
> # 'sla': timedelta(hours=1),# default 
> tasks' sla - normally don't run longer
> 'execution_timeout': timedelta(hours=3),# no 
> single task runs 3 hours or more
> # 'sla_miss_callback'   # - 
> function to call when reporting SLA timeouts
> # Notifications:
> 'email': ['rdautkha...@epsilon.com'],
> 'email_on_failure' : True,
> 'email_on_retry'   : True,
> # Resource usage:
> 'pool' : 'DISCOVER-Prod',   # can 
> increase this pool's concurrency
> # 'queue'  : 'some_queue',
> # 'priority_weight': 10,
> # Miscellaneous:
> # on_failure_callback=None, on_success_callback=None, 
> on_retry_callback=None
> }


The DAG itself has a bunch of Oracle operators.

Any ideas?

That's a regression from a month old Airflow.
No changes in DAG.



Thank you,
Ruslan Dautkhanov


Re: regression from a month-old aitflow version

2017-05-06 Thread Ruslan Dautkhanov
Thanks for the follow up Chris.
It used to work for me with catchup=False in a month-old version of
Airflow. That's why I mentioned it as a regression.

Tried today catchup=True with @once seems actually tries to "catchup" which
does not make sense for @once schedule,
notice there is one active run and one pending/"scheduled":
   [image: Inline image 1]

So we can't really use @once with catchup=True and it's not a workaround
for this problem.

Thanks.



-- 
Ruslan Dautkhanov

On Sat, May 6, 2017 at 10:47 AM, Chris Fei  wrote:

> I wonder if your issue is the same root cause as AIRFLOW-1013[1] (which
> you seem to have reported) and AIRFLOW-1055[2]. I haven't tried it
> myself, but that second ticket seems to indicate that a workaround
> could be setting catchup = True on your DAG. Not sure if that's an
> option for you.
> On Sat, May 6, 2017, at 12:29 PM, Ruslan Dautkhanov wrote:
> > I've upgraded Airflow to today's master branch.
> >
> > Got following regression in attempt to start a DAG:
> >
> > Process DagFileProcessor209-Process:
> >> Traceback (most recent call last):
> >> File
> >> "/opt/cloudera/parcels/Anaconda/lib/python2.7/multiprocessing/proce-
> >> ss.py",>> line 258, in _bootstrap
> >>   self.run()
> >> File
> >> "/opt/cloudera/parcels/Anaconda/lib/python2.7/multiprocessing/proce-
> >> ss.py",>> line 114, in run
> >>   self._target(*self._args, **self._kwargs)
> >> File "/opt/airflow/airflow-
> >> 20170506/src/airflow/airflow/jobs.py", line>> 346, in helper
> >>   pickle_dags)
> >> File "/opt/airflow/airflow-20170506/src/airflow/airflow/utils/db.py",>>
> line 48, in wrapper
> >>   result = func(*args, **kwargs)
> >> File "/opt/airflow/airflow-
> >> 20170506/src/airflow/airflow/jobs.py", line>> 1584, in process_file
> >>   self._process_dags(dagbag, dags, ti_keys_to_schedule)
> >> File "/opt/airflow/airflow-
> >> 20170506/src/airflow/airflow/jobs.py", line>> 1173, in _process_dags
> >>   dag_run = self.create_dag_run(dag)
> >> File "/opt/airflow/airflow-20170506/src/airflow/airflow/utils/db.py",>>
> line 48, in wrapper
> >>   result = func(*args, **kwargs)
> >> File "/opt/airflow/airflow-
> >> 20170506/src/airflow/airflow/jobs.py", line>> 776, in create_dag_run
> >>   if next_start <= now:
> >> TypeError: can't compare datetime.datetime to NoneType
> >
> >
> >
> > DAG definition:
> >
> > main_dag = DAG(
> >>   dag_id = 'DISCOVER-Oracle-Load-Mar2017-v1',>>
>  default_args   = default_args,  #
> >>   dafeult operators' arguments - see above>>   user_defined_macros
>   = dag_macros,   # I do not get
> >>   different between>>   ## params =
> dag_macros,   #
> >>   ## user_defined_macros and params>>   #
> >>   start_date = datetime.now(),#
> >>   or e.g. datetime(2015, 6, 1)>>   # 'end_date'   =
> datetime(2016, 1, 1),
> >>   catchup= False, #
> >>   Perform scheduler catchup (or only run latest)?>>
># -
> defaults to True>>   schedule_interval  = '@once',
>#
> >>   '@once'=None?>>
>#
> doesn't create multiple dag runs automatically>>   concurrency
> = 3, #
> >>   task instances allowed to run concurrently>>   max_active_runs
> = 1, #
> >>   only one DAG run at a time>>   dagrun_timeout =
> timedelta(days=4), #
> >>   no way this dag should ran for 4 days>>   orientation
> = 'TB',  #
> >>   default graph view>> )
> >
> >
> > default_args:
> >
> > default_args = {
> >>   # Security:
> >>   'owner': 'rdautkha',#
> >>   owner of the task, using the unix username is recommended>>   #
> 'run_as_user': None   #
> >>   # unix username to impersonate while running the task>>   #
> Scheduling:
> >>

Re: regression from a month-old aitflow version

2017-05-07 Thread Ruslan Dautkhanov
Filed https://issues.apache.org/jira/browse/AIRFLOW-1178 for @once being
scheduled twice.




-- 
Ruslan Dautkhanov

On Sat, May 6, 2017 at 9:30 PM, Ruslan Dautkhanov 
wrote:

> Thanks for the follow up Chris.
> It used to work for me with catchup=False in a month-old version of
> Airflow. That's why I mentioned it as a regression.
>
> Tried today catchup=True with @once seems actually tries to "catchup"
> which does not make sense for @once schedule,
> notice there is one active run and one pending/"scheduled":
>[image: Inline image 1]
>
> So we can't really use @once with catchup=True and it's not a workaround
> for this problem.
>
> Thanks.
>
>
>
> --
> Ruslan Dautkhanov
>
> On Sat, May 6, 2017 at 10:47 AM, Chris Fei  wrote:
>
>> I wonder if your issue is the same root cause as AIRFLOW-1013[1] (which
>> you seem to have reported) and AIRFLOW-1055[2]. I haven't tried it
>> myself, but that second ticket seems to indicate that a workaround
>> could be setting catchup = True on your DAG. Not sure if that's an
>> option for you.
>> On Sat, May 6, 2017, at 12:29 PM, Ruslan Dautkhanov wrote:
>> > I've upgraded Airflow to today's master branch.
>> >
>> > Got following regression in attempt to start a DAG:
>> >
>> > Process DagFileProcessor209-Process:
>> >> Traceback (most recent call last):
>> >> File
>> >> "/opt/cloudera/parcels/Anaconda/lib/python2.7/multiprocessing/proce-
>> >> ss.py",>> line 258, in _bootstrap
>> >>   self.run()
>> >> File
>> >> "/opt/cloudera/parcels/Anaconda/lib/python2.7/multiprocessing/proce-
>> >> ss.py",>> line 114, in run
>> >>   self._target(*self._args, **self._kwargs)
>> >> File "/opt/airflow/airflow-
>> >> 20170506/src/airflow/airflow/jobs.py", line>> 346, in helper
>> >>   pickle_dags)
>> >> File "/opt/airflow/airflow-20170506/src/airflow/airflow/utils/db.py",>>
>> line 48, in wrapper
>> >>   result = func(*args, **kwargs)
>> >> File "/opt/airflow/airflow-
>> >> 20170506/src/airflow/airflow/jobs.py", line>> 1584, in process_file
>> >>   self._process_dags(dagbag, dags, ti_keys_to_schedule)
>> >> File "/opt/airflow/airflow-
>> >> 20170506/src/airflow/airflow/jobs.py", line>> 1173, in _process_dags
>> >>   dag_run = self.create_dag_run(dag)
>> >> File "/opt/airflow/airflow-20170506/src/airflow/airflow/utils/db.py",>>
>> line 48, in wrapper
>> >>   result = func(*args, **kwargs)
>> >> File "/opt/airflow/airflow-
>> >> 20170506/src/airflow/airflow/jobs.py", line>> 776, in create_dag_run
>> >>   if next_start <= now:
>> >> TypeError: can't compare datetime.datetime to NoneType
>> >
>> >
>> >
>> > DAG definition:
>> >
>> > main_dag = DAG(
>> >>   dag_id = 'DISCOVER-Oracle-Load-Mar2017-v1',>>
>>  default_args   = default_args,  #
>> >>   dafeult operators' arguments - see above>>   user_defined_macros
>> = dag_macros,   # I do not get
>> >>   different between>>   ## params =
>> dag_macros,   #
>> >>   ## user_defined_macros and params>>   #
>> >>   start_date = datetime.now(),#
>> >>   or e.g. datetime(2015, 6, 1)>>   # 'end_date'   =
>> datetime(2016, 1, 1),
>> >>   catchup= False, #
>> >>   Perform scheduler catchup (or only run latest)?>>
>># -
>> defaults to True>>   schedule_interval  = '@once',
>>#
>> >>   '@once'=None?>>
>>#
>> doesn't create multiple dag runs automatically>>   concurrency
>> = 3, #
>> >>   task instances allowed to run concurrently>>   max_active_runs
>>   = 1, #
>> >>   only one DAG run at a time>>   dagrun_timeout =
>> timedelta(days=4), #
>> >>   no way this dag should ran for 4 days>>   orientation
>>   

Re: Airflow on windows

2017-05-16 Thread Ruslan Dautkhanov
You could try to run gunicorn and other python modules including airflow
in Cygwin environment on windows [1]
(not tested but may or may not work after some tweaking)

https://www.cygwin.com/

[1]


> rdautkhanov@CO-LT53378 /usr/bin
> $ /usr/bin/python
> Python 2.7.13 (default, Mar 13 2017, 20:56:15)
> [GCC 5.4.0] on cygwin
> >>> import gunicorn
> >>>
> >>>
>
> rdautkhanov@CO-LT53378 /usr/bin
> $ uname -a
> CYGWIN_NT-6.1-WOW CO-LT53378 2.2.1(0.289/5/3) 2015-08-20 11:40 i686 Cygwin
>
> rdautkhanov@CO-LT53378 /usr/bin
> $


Fwd: Integrating with Airflow

2017-05-30 Thread Ruslan Dautkhanov
Forwarding from Zeppelin user group.

Thought might be interesting for Airflow users too, how others are
integrating Airflow and Zeppelin.



-- Forwarded message --
From: Erik Schuchmann 
Date: Tue, May 30, 2017 at 8:53 AM
Subject: Re: Integrating with Airflow
To: us...@zeppelin.apache.org


We have begun experimenting with an airflow/zeppelin integration.  We use
the first paragraph of a note to define dependencies and outputs; names and
owners; and schedule for the note. There are utility functions (in scala)
available that provide a data catalog for retrieving data sources.  These
functions return a dataframe and record that note's dependency on a
particular data source so that a dag can be constructed between the task
that creates the input data and the zeppelin note.  There is also a
function to provide a dataframe writer that captures the outputs that a
note provides.  This registers that note as the source for data that is
then available in the data catalog for other notebooks to use.  This allows
one notebook to have a dependency on data created by another notebook.

An airflow dag generator (python code) queries the zeppelin notebook
server, looking for the results of the first paragraph for each note.  It
uses these outputs to construct the DAG between notebooks.  It generates a
ZeppelinNoteOperator for each note that will use the zeppelin REST api to
execute the notebook when the scheduler schedules that task.

We've just started to use this so we don't have a lot of experience with it
yet.  The biggest caveats to start are:

* there is no mechanism for test-cases of note code
* We have to call the notebook server on every iteration of the
scheduler/whenever a dag is init'd - we use the cached results if
available, but it still requires a round trip to the zeppelin notebook
server

Regards,
Erik


On Fri, May 19, 2017 at 10:15 PM Ben Vogan  wrote:

> Thanks for sharing this Ruslan - I will take a look.
>
> I agree that paragraphs can form tasks within a DAG.  My point was that
> ideally a DAG could encompass multiple notes.  I.e. the completion of one
> note triggers another and so on to complete an entire chain of dependent
> tasks.
>
> For example team A has a note that generates data set A*.  Teams B & C
> each have notes that depend on A* to generate B* & C* for their specific
> purposes.  It doesn't make sense for all of that to have to live in one
> note, but they are all part of a single workflow.
>
> Best,
> --Ben
>
> On Fri, May 19, 2017 at 9:02 PM, Ruslan Dautkhanov 
> wrote:
>
>> Thanks for sharing this Ben.
>>
>> I agree Zeppelin is a better fit with tighter integration with Spark and
>> built-in visualizations.
>>
>> We have pretty much standardized on pySpark, so here's one of the scripts
>> we use internally
>> to extract %pyspark, %sql and %md paragraphs into a standalone script
>> (that can be scheduled in Airflow for example)
>> https://github.com/Tagar/stuff/blob/master/znote.py (patches are welcome
>> :-)
>>
>> Hope this helps.
>>
>> ps. In my opinion adding dependencies between paragraphs wouldn't be that
>> hard for simple cases,
>> and can be first step to define a DAG in Zeppelin directly. It would be
>> really awesome if we see this type of
>> integration in the future.
>>
>> Othewise I don't see much value if a whole note/ whole workflow would run
>> as a single task in Airflow.
>> In my opinion, each paragraph has to be a task... then it'll be very
>> useful.
>>
>>
>> Thanks,
>> Ruslan
>>
>>
>> On Fri, May 19, 2017 at 4:55 PM, Ben Vogan  wrote:
>>
>>> I do not expect the relationship between DAGs to be described in
>>> Zeppelin - that would be done in Airflow.  It just seems that Zeppelin is
>>> such a great tool for a data scientists workflow that it would be nice if
>>> once they are done with the work the note could be productionized
>>> directly.  I could envision a couple of scenarios:
>>>
>>> 1. Using a zeppelin instance to run the note via the REST API.  The
>>> instance could be containerized and spun up specifically for a DAG or it
>>> could be a permanently available one.
>>> 2. A note could be pulled from git and some part of the Zeppelin engine
>>> could execute the note without the web UI at all.
>>>
>>> I would expect on the airflow side there to be some special operators
>>> for executing these.
>>>
>>> If the scheduler is pluggable then it should be possible to create a
>>> plug in that talks to the Airflow REST API.
>>>
>>> I happen to prefer Zeppelin to Jupyter - although 

Re: Release Manager for 1.8.2?

2017-06-08 Thread Ruslan Dautkhanov
It would be great if somebody would have a look at following 3 jiras

I've flagged
https://issues.apache.org/jira/browse/AIRFLOW-1013
https://issues.apache.org/jira/browse/AIRFLOW-1178
https://issues.apache.org/jira/browse/AIRFLOW-1055

Two of them were Blockers for one of previous 1.8 releases but were
de-escalated because no resources to fix them.

I'm currently Assignee on AIRFLOW-1055 - can't unassign myself - feel free
to scratch my name there.




-- 
Ruslan Dautkhanov

On Fri, Jun 9, 2017 at 12:23 AM, Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> Cool. I'll let everyone know when RC1 is out.
>
> @community, if you think anything else should be a blocker for 1.8.2 please
> flag it in Jira very soon or it will miss the release train!
>
> Max
>
> On Thu, Jun 8, 2017 at 9:06 AM, Chris Riccomini 
> wrote:
>
> > Thanks for doing all this, Max! When you're ready, I can deploy to some
> of
> > our environments to verify. Just let me know.
> >
> > On Thu, Jun 8, 2017 at 8:42 AM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> > > I branched off `v1-8-test` so all should be good. I just didn't know
> if I
> > > could move ahead with that branch just yet so I branched off. I just
> got
> > > back on `v1-8-test` and pushed what I have to Apache [somehow I had to
> > > rebase meaning someone added something over the past 24 hours].
> > >
> > > I just set `Fix Version` of AIRFLOW-1294 to 1.8.2 and set it to
> blocker.
> > >
> > > I cherry-picked all the commits targeting 1.8.2 I could without getting
> > > into conflicts. I'd only work at resolving conflicts on blockers. My
> plan
> > > is to add only the 2 rows in red for RC1.
> > >
> > > Here's the current output of `airflow-jira 1.8.2`:
> > >
> > > ISSUE ID  |TYPE||PRIORITY  ||STATUS|DESCRIPTION
> > >|MERGED|PR|COMMITAIRFLOW-1294
> > >|Bug ||Blocker   ||Open  |Backfills can loose tasks to
> > > execute due to tasks |0 |- |-
> > > AIRFLOW-1291  |Bug ||Blocker ||Open  |Update
> > > NOTICE and LICENSE files to meet ASF specs |0 |- |-
> > > AIRFLOW-1290  |Bug ||Major ||Resolved  |Change docs
> > > author from "Maxime Beauchemin" to "Ap|1 |#na
> > > |fc33c040605e7b0121739af841790336bdfa3948
> > > AIRFLOW-1282  |Bug ||Major ||Resolved  |Fix known
> > > event column sorting|1 |#2350
> > > |2e9eee3697f6261de61c77ba584ace14b22751f0
> > > AIRFLOW-1281  |Bug ||Minor ||Resolved  |Sort variables
> > > by key field by default|1 |#2347
> > > |57d5bcdad4202fd8a94be07b4aeacee065801481
> > > AIRFLOW-1275  |Bug ||Major ||In Progress|Fix `airflow
> > > pool` command exception  |0 |- |-
> > > AIRFLOW-1274  |Bug ||Major ||Open  |HttpSensor
> > > parameter params is overriding BaseOper|0 |- |-
> > > AIRFLOW-1266  |Bug ||Major ||Open  |Long task
> > > names are truncated in gannt view   |1 |#2345
> > > |7f33f6e2fe4149e547860fdb237da26a5b7023ec
> > > AIRFLOW-1263  |Bug ||Major ||Resolved  |Airflow Charts
> > > Pages Should Have Dynamic Heights  |1 |#2344
> > > |acd0166d8c7f5639898791848d985f6441b3b5b5
> > > AIRFLOW-1244  |Bug ||Major ||Resolved  |Forbid
> > > creation of a pool with empty name |1 |#2324
> > > |802fc1549d0f173c646794f2476156d34383adc0
> > > AIRFLOW-1243  |Bug ||Minor ||Resolved  |DAGs table has
> > > no default entries to show |1 |#2323
> > > |1232b6a4fd691e0faf081d4cf792a6d96d33b595
> > > AIRFLOW-1227  |Bug ||Minor ||Resolved  |Remove empty
> > > column on the Logs view  |1 |#2310
> > > |b0ba3c91044b2abd92289d9b3f7d913060ef6c31
> > > AIRFLOW-1226  |Bug ||Minor ||Resolved  |Remove empty
> > > column on the Jobs view  |1 |#2309
> > > |c406652bd4c8f8099841c69f2d1ccc9678d77bfd
> > > AIRFLOW-1221  |Bug ||Major ||Resolved  |Fix
> > > DatabricksSubmitRunOperator Templating|0 |- |-
> > > AIRFLOW-1217  |Bug ||Major ||Resolved  |Enable logging
> > > in Sqoop hook  |0 |- |-
> > > AIRFLOW-1213  |Bug |

Re: Airflow profiling

2017-06-27 Thread Ruslan Dautkhanov
Maybe not as powerful as stackImpact, another nice option:
https://www.jetbrains.com/help/pycharm/profiler.html



-- 
Ruslan Dautkhanov

On Tue, Jun 27, 2017 at 2:40 PM, Alex Guziel  wrote:

> Yeah, actually we have setup Newrelic for Airflow too at Airbnb, which
> gives decent insights into webserver perf. In terms of SQL queries, adding
> `echo=True` to the SQLAlchemy engine creation is pretty good for seeing
> which sql queries get created. I tried some Python profilers before but
> they weren't super helpful.
>
> On Tue, Jun 27, 2017 at 1:27 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > Nice. It would be great if DAG parsing was faster, and some of the
> > endpoints on the website have grown really slow as you we've grown the
> > number of DAGs, and on the DAGs with large number of tasks.
> >
> > I had the intuition that DAG parsing could be faster if operators
> > late-imported hooks (who themselves import external libs) but I have no
> > evidence or test to support it.
> >
> > I'm sure there's tons of low hanging fruit and this type of tool should
> > make it really clear.
> >
> > We've set up NewRelic (which seems similar as this tooling at first
> sight)
> > for Superset at Airbnb and it gave us great insight.
> >
> > Max
> >
> > On Tue, Jun 27, 2017 at 1:01 PM, Bolke de Bruin 
> wrote:
> >
> > > Free version also there, maybe more integration testing and
> benchmarking.
> > >
> > > https://stackimpact.com/pricing/ <https://stackimpact.com/pricing/>
> > >
> > > B.
> > >
> > > > On 27 Jun 2017, at 22:00, Chris Riccomini 
> > wrote:
> > > >
> > > > Seems you have to pay?
> > > >
> > > > On Tue, Jun 27, 2017 at 12:56 PM, Bolke de Bruin 
> > > wrote:
> > > >
> > > >> Just saw this tool on hacker news:
> > > >>
> > > >> https://github.com/stackimpact/stackimpact-python <
> > https://github.com/
> > > >> stackimpact/stackimpact-python>
> > > >>
> > > >> Might be interesting for some profiling.
> > > >>
> > > >> Bolke
> > >
> > >
> >
>


Re: [VOTE] Release Airflow 1.8.2 based on Airflow 1.8.2 RC4

2017-08-16 Thread Ruslan Dautkhanov
+1
On Wed, Aug 16, 2017 at 3:33 PM Bolke de Bruin  wrote:

> Max
>
> Can you move this forward please? You have received 2 binding +1 if you
> include your own it's 3 and the vote has been open long enough. Time to go
> to the ipmc!
>
> Cheers
> Bolke
>
> Sent from my iPhone
>
> > On 9 Aug 2017, at 20:06, Bolke de Bruin  wrote:
> >
> > We will maintain 1.8.X and 1.9.X until a next major release. So there
> will be a 1.8.3 regardless of 1.9.
> >
> > Bolke
> >
> > Sent from my iPhone
> >
> >> On 9 Aug 2017, at 19:07, George Leslie-Waksman 
> >> 
> wrote:
> >>
> >> We've been having a lot of problems with 1.8.1 because of a bug in the
> >> DagRun deadlock detection, reported under AIRFLOW-1420 and AIRFLOW-1473.
> >>
> >> I have a bugfix PR out:
> >> https://github.com/apache/incubator-airflow/pull/2506
> >>
> >> I would love to see this bugfix make it in before work starts on new
> >> features for 1.9.0. If it can make it in to 1.8.2, great; if not, I
> would
> >> hope that there can be a 1.8.3 with it sooner rather than later.
> >>
> >> --George
> >>
> >> On Wed, Aug 9, 2017 at 8:58 AM Maxime Beauchemin <
> maximebeauche...@gmail.com>
> >> wrote:
> >>
> >>> I still consider 1.8.2 a bit of a practice run for me, once it's out
> I'll
> >>> kick off 1.9.0 fresh forked off master.
> >>>
> >>> Max
> >>>
> >>> On Wed, Aug 9, 2017 at 8:57 AM, Maxime Beauchemin <
> >>> maximebeauche...@gmail.com> wrote:
> >>>
>  Oh I need to simplify the instructions and point to using that.
> 
>  Max
> 
>  On Wed, Aug 9, 2017 at 8:51 AM, Bolke de Bruin 
> >>> wrote:
> 
> > +1 (binding),
> >
> > You know that there is a sign.sh in the dev folder? Makes your life
> > easier :-).
> >
> >
> > Sent from my iPhone
> >
> >> On 8 Aug 2017, at 19:37, Chris Riccomini 
> >>> wrote:
> >>
> >> Gocha.
> >>
> >> +1 (binding)
> >>
> >> Validated artifacts:
> >>
> >> $ gpg --print-md SHA512 apache-airflow-1.8.2+incubating-bin.tar.gz
> >> apache-airflow-1.8.2+incubating-source.tar.gz
> >> apache-airflow-1.8.2+incubating-bin.tar.gz:
> >> F32FFE95 BDC2066A 125624F1 62731E34 AD45D5A4 01B73345 FDD3FF09
> >>> FFD9CEF8
> >> 632611DF
> >> B943FA79 9074B8AA 1B54616F EB47A3DB B35D740A 1DBB907D 32E83F59
> >> apache-airflow-1.8.2+incubating-source.tar.gz:
> >> 51381ED4 7FA0C0DE 5822BA23 B9EC7290 EA668259 6CD03800 B176E9A6
> >>> 24F27955
> >> 8CC0B1A8
> >> 7DC89DF8 FF58A32F 20B5C448 FC7EECA5 17E0B749 6B356B42 05E020AB
> >>
> >> $ cat apache-airflow-1.8.2+incubating-bin.tar.gz.sha
> >> apache-airflow-1.8.2+incubating-source.tar.gz.sha
> >> apache-airflow-1.8.2+incubating-bin.tar.gz:
> >> F32FFE95 BDC2066A 125624F1 62731E34 AD45D5A4 01B73345 FDD3FF09
> >>> FFD9CEF8
> >> 632611DF
> >> B943FA79 9074B8AA 1B54616F EB47A3DB B35D740A 1DBB907D 32E83F59
> >> apache-airflow-1.8.2+incubating-source.tar.gz:
> >> 51381ED4 7FA0C0DE 5822BA23 B9EC7290 EA668259 6CD03800 B176E9A6
> >>> 24F27955
> >> 8CC0B1A8
> >> 7DC89DF8 FF58A32F 20B5C448 FC7EECA5 17E0B749 6B356B42 05E020AB
> >>
> >> $ md5 apache-airflow-1.8.2+incubating-bin.tar.gz
> >> apache-airflow-1.8.2+incubating-source.tar.gz
> >> MD5 (apache-airflow-1.8.2+incubating-bin.tar.gz) =
> >> 7e57eda714847f0057f3e31daf90a3d6
> >> MD5 (apache-airflow-1.8.2+incubating-source.tar.gz) =
> >> 62d371c2e828f6631e8d8646f09bf593
> >>
> >> $ cat apache-airflow-1.8.2+incubating-bin.tar.gz.md5
> >> apache-airflow-1.8.2+incubating-source.tar.gz.md5
> >> apache-airflow-1.8.2+incubating-bin.tar.gz:
> >> 7E 57 ED A7 14 84 7F 00  57 F3 E3 1D AF 90 A3 D6
> >> apache-airflow-1.8.2+incubating-source.tar.gz:
> >> 62 D3 71 C2 E8 28 F6 63  1E 8D 86 46 F0 9B F5 93
> >>
> >> $ gpg --verify  apache-airflow-1.8.2+incubating-bin.tar.gz.asc
> >> apache-airflow-1.8.2+incubating-bin.tar.gz
> >> gpg: Signature made Mon Aug  7 22:03:07 2017 PDT using RSA key ID
> > C7BC7E0D
> >> gpg: Good signature from "Maxime Beauchemin <
> > maximebeauche...@apache.org>"
> >> [unknown]
> >> gpg: WARNING: This key is not certified with a trusted signature!
> >> gpg:  There is no indication that the signature belongs to
> the
> >> owner.
> >> Primary key fingerprint: 99E2 0282 4A25 EAA3 7505  50BE E6F0 505C
> C7BC
> > 7E0D
> >>
> >> $ gpg --verify apache-airflow-1.8.2rc3+incubating.tar.gz.asc
> >> apache-airflow-1.8.2rc3+incubating.tar.gz
> >> gpg: Signature made Tue Aug  1 12:43:42 2017 PDT using RSA key ID
> > C7BC7E0D
> >> gpg: Good signature from "Maxime Beauchemin <
> > maximebeauche...@apache.org>"
> >> [unknown]
> >> gpg: WARNING: This key is not certified with a trusted signature!
> >> gpg:  There is no indication that the signature belongs to
> the
> >> owner.
> >> Primary key fingerprint: 99E2 0282 4A25 EAA3 7505  50BE E6F0 505C
>

Re: Airflow + MsSQL database hangs

2017-08-30 Thread Ruslan Dautkhanov
>> I tried to use NullPool ( change in Airflow) but creating over 1000
>> connections in minutes is to high time overhead.

It might be a tuning problem with your backend mssql database.
Try decreasing sql_alchemy_pool_size, parallelism, dag_concurrency to a
fairly small values to see if you can reproduce this issue?

Out of the box SQL Server not necessarily scales well (unlike Oracle for
example).
SQL Server defaults to blocking selects on uncommitted data.
It seems in that exception stack hanging happens in commits.
For highly concurrent workloads it's recommended row versioning-based
isolation levels to start the runing process
https://msdn.microsoft.com/en-us/library/ms175095.aspx



-- 
Ruslan Dautkhanov

On Wed, Aug 30, 2017 at 8:53 AM, Arkadiusz Kołodziejski <
arkadiusz.kolodziej...@gmail.com> wrote:

> Hello,
>
> I tried to use Airflow 1.8.2RC2 and 1.8.2.RC4 with MSSQL database.
> Unfortunately I always got Airflow Scheduler hangs. It is always after a
> few minutes after 'airflow scheduler ' process start. I would like to share
> results of my investigation of this problem.
>
>
>
> Here is example stacktrace from hanged  process:
>
> Current thread 0x7f08faed0700 (most recent call first):
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/lib/
> python3.5/site-packages/sqlalchemy/engine/default.py",
> line 440 in do_rollback
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/lib/
> python3.5/site-packages/sqlalchemy/pool.py",
> line 829 in _reset
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/lib/
> python3.5/site-packages/sqlalchemy/pool.py",
> line 687 in _finalize_fairy
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/lib/
> python3.5/site-packages/sqlalchemy/pool.py",
> line 811 in _checkin
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/lib/
> python3.5/site-packages/sqlalchemy/pool.py",
> line 960 in close
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/lib/
> python3.5/site-packages/sqlalchemy/engine/base.py",
> line 859 in close
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/lib/
> python3.5/site-packages/sqlalchemy/orm/session.py",
> line 542 in close
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/lib/
> python3.5/site-packages/sqlalchemy/orm/session.py",
> line 473 in commit
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/lib/
> python3.5/site-packages/sqlalchemy/orm/session.py",
> line 906 in commit
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/src/
> apache-airflow/airflow/jobs.py",
> line 161 in heartbeat
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/src/
> apache-airflow/airflow/jobs.py",
> line 1454 in _execute_helper
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/src/
> apache-airflow/airflow/jobs.py",
> line 1311 in _execute
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/src/
> apache-airflow/airflow/jobs.py",
> line 201 in run
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/src/
> apache-airflow/airflow/bin/cli.py",
> line 882 in scheduler
>
>   File
> "/home/administrator/code/wf/workflow-airflow/venv/src/
> apache-airflow/airflow/bin/airflow",
> line 28 in 
>
>   File "/home/administrator/code/wf/workflow-airflow/venv/bin/airflow",
> line 6 in 
>
>
>
> But sometimes hangs on other method of cursor.
>
>
>
> Facts:
>
>- Hangs on MSSQL DB and do not on POSTGRESQL
>- Hangs on remote DB connections , i do not observe this on local DB
>connections
>- Hangs with pymssql and pyodbc dialects
>- Hangs in 1.8.2RC2 and 1.8.2RC4
>- Hangs with SQLAlchemy Engine StaticPool, SingletonThreadPool and
>QueuePool
>- *Works with NullPool* ( new connection on get from pool)
>
>
>
> I tried to use NullPool ( change in Airflow) but creating over 1000
> connections in minutes is to high time overhead.
>
>
>
> Has anyone faced this kind on probles with MSSQL DB ?
>
>
> Thanks,
>
> Arek
>
>
> 
>  I am an Intel employee. All comments and opinions are my own and do not
> represent the views of Intel.
>


Re: Airflow + MsSQL database hangs

2017-08-30 Thread Ruslan Dautkhanov
*to start the tuning process )



-- 
Ruslan Dautkhanov

On Wed, Aug 30, 2017 at 3:10 PM, Ruslan Dautkhanov 
wrote:

> >> I tried to use NullPool ( change in Airflow) but creating over 1000
> >> connections in minutes is to high time overhead.
>
> It might be a tuning problem with your backend mssql database.
> Try decreasing sql_alchemy_pool_size, parallelism, dag_concurrency to a
> fairly small values to see if you can reproduce this issue?
>
> Out of the box SQL Server not necessarily scales well (unlike Oracle for
> example).
> SQL Server defaults to blocking selects on uncommitted data.
> It seems in that exception stack hanging happens in commits.
> For highly concurrent workloads it's recommended row versioning-based
> isolation levels to start the runing process
> https://msdn.microsoft.com/en-us/library/ms175095.aspx
>
>
>
> --
> Ruslan Dautkhanov
>
> On Wed, Aug 30, 2017 at 8:53 AM, Arkadiusz Kołodziejski <
> arkadiusz.kolodziej...@gmail.com> wrote:
>
>> Hello,
>>
>> I tried to use Airflow 1.8.2RC2 and 1.8.2.RC4 with MSSQL database.
>> Unfortunately I always got Airflow Scheduler hangs. It is always after a
>> few minutes after 'airflow scheduler ' process start. I would like to
>> share
>> results of my investigation of this problem.
>>
>>
>>
>> Here is example stacktrace from hanged  process:
>>
>> Current thread 0x7f08faed0700 (most recent call first):
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/lib/pytho
>> n3.5/site-packages/sqlalchemy/engine/default.py",
>> line 440 in do_rollback
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/lib/pytho
>> n3.5/site-packages/sqlalchemy/pool.py",
>> line 829 in _reset
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/lib/pytho
>> n3.5/site-packages/sqlalchemy/pool.py",
>> line 687 in _finalize_fairy
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/lib/pytho
>> n3.5/site-packages/sqlalchemy/pool.py",
>> line 811 in _checkin
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/lib/pytho
>> n3.5/site-packages/sqlalchemy/pool.py",
>> line 960 in close
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/lib/pytho
>> n3.5/site-packages/sqlalchemy/engine/base.py",
>> line 859 in close
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/lib/pytho
>> n3.5/site-packages/sqlalchemy/orm/session.py",
>> line 542 in close
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/lib/pytho
>> n3.5/site-packages/sqlalchemy/orm/session.py",
>> line 473 in commit
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/lib/pytho
>> n3.5/site-packages/sqlalchemy/orm/session.py",
>> line 906 in commit
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/src/apach
>> e-airflow/airflow/jobs.py",
>> line 161 in heartbeat
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/src/apach
>> e-airflow/airflow/jobs.py",
>> line 1454 in _execute_helper
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/src/apach
>> e-airflow/airflow/jobs.py",
>> line 1311 in _execute
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/src/apach
>> e-airflow/airflow/jobs.py",
>> line 201 in run
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/src/apach
>> e-airflow/airflow/bin/cli.py",
>> line 882 in scheduler
>>
>>   File
>> "/home/administrator/code/wf/workflow-airflow/venv/src/apach
>> e-airflow/airflow/bin/airflow",
>> line 28 in 
>>
>>   File "/home/administrator/code/wf/workflow-airflow/venv/bin/airflow",
>> line 6 in 
>>
>>
>>
>> But sometimes hangs on other method of cursor.
>>
>>
>>
>> Facts:
>>
>>- Hangs on MSSQL DB and do not on POSTGRESQL
>>- Hangs on remote DB connections , i do not observe this on local DB
>>connections
>>- Hangs with pymssql and pyodbc dialects
>>- Hangs in 1.8.2RC2 and 1.8.2RC4
>>- Hangs with SQLAlchemy Engine StaticPool, SingletonThreadPool and
>>QueuePool
>>- *Works with NullPool* ( new connection on get from pool)
>>
>>
>>
>> I tried to use NullPool ( change in Airflow) but creating over 1000
>> connections in minutes is to high time overhead.
>>
>>
>>
>> Has anyone faced this kind on probles with MSSQL DB ?
>>
>>
>> Thanks,
>>
>> Arek
>>
>>
>> 
>>  I am an Intel employee. All comments and opinions are my own and do not
>> represent the views of Intel.
>>
>
>


Task log handler file.task does not support read logs. 'TaskInstance' object has no attribute 'task'

2017-09-18 Thread Ruslan Dautkhanov
Testing master..

Old workflow that worked just fine on 1.8.0 now complains

Task log handler file.task does not support read logs.
> 'TaskInstance' object has no attribute 'task'


Any ideas? It might be related to latest logging changes?


-- 
Ruslan Dautkhanov


Re: Task log handler file.task does not support read logs. 'TaskInstance' object has no attribute 'task'

2017-09-19 Thread Ruslan Dautkhanov
That happens in UI when click on a task instance -> View Log.
It shows just this error. No exception stack - nothing in the webserver log
too.
Thanks.


-- 
Ruslan Dautkhanov

On Tue, Sep 19, 2017 at 2:21 AM, Bolke de Bruin  wrote:

> Can you report the stack trace please?
>
> Cheers
> Bolke
>
> > On 19 Sep 2017, at 08:55, Ruslan Dautkhanov 
> wrote:
> >
> > Testing master..
> >
> > Old workflow that worked just fine on 1.8.0 now complains
> >
> > Task log handler file.task does not support read logs.
> >> 'TaskInstance' object has no attribute 'task'
> >
> >
> > Any ideas? It might be related to latest logging changes?
> >
> >
> > --
> > Ruslan Dautkhanov
>
>


Re: Proposal: Set Celery 4.0 as a minimum as Celery 4 is unsupported

2017-09-19 Thread Ruslan Dautkhanov
4.0 minimum
and 4.1 as recommended?
Celery 4.1 re-added sqla transport
https://issues.apache.org/jira/browse/AIRFLOW-797
https://issues.apache.org/jira/browse/AIRFLOW-749



-- 
Ruslan Dautkhanov

On Tue, Sep 19, 2017 at 9:54 AM, Alex Guziel  wrote:

> That's probably fine but I'd like to note two things.
>
> 1) The celery 3 config options are forwards compatible as far as I know
> 2) Still doesn't fix the bug where tasks get reserved even though it
> shouldn't.
>
>
> But I think it makes sense to upgrade the version in setup.py regardless.
>
> On Tue, Sep 19, 2017 at 2:39 AM Bolke de Bruin  wrote:
>
> > ping.
> >
> > I'd like some more feedback.
> >
> > Cheers
> > Bolke
> >
> > Verstuurd vanaf mijn iPad
> >
> > > Op 16 sep. 2017 om 17:45 heeft Ash Berlin-Taylor <
> > ash_airflowl...@firemirror.com> het volgende geschreven:
> > >
> > > +1 from us, we're running on Celery 4.0.2 in production on Airflow
> 1.8.2
> > (4.1 wasn't out when we started and haven't upgraded in prod yet)
> > >
> > >
> > >> On 16 Sep 2017, at 16:35, Bolke de Bruin  wrote:
> > >>
> > >> Hi,
> > >>
> > >> Some refactoring of the Celery config is underway and as some of the
> > options have changed between Celery 3 and 4 I have asked the question
> > whether Celery 3 is still supported. Apparently it is not (
> > https://github.com/celery/celery/issues/4258 <
> > https://github.com/celery/celery/issues/4258>).
> > >>
> > >> As Celery 3 is also 2 major releases behind I propose to set Celery
> 4.0
> > as a minimum supported version with Celery 4.1 being the recommend
> version.
> > I think this is also important as we do see some intermittent issues with
> > Celery that are reported to us but are, most likely, issues in Celery. I
> > don’t want to take care of those as they are difficult to debug.
> > >>
> > >> I would like to do this per 1.9.0. The 1.8.X branch can still support
> > Celery 3.
> > >>
> > >> Cheers
> > >> Bolke
> > >
> >
>


Re: Task log handler file.task does not support read logs. 'TaskInstance' object has no attribute 'task'

2017-09-20 Thread Ruslan Dautkhanov
I just tested with PR-2578 - yep it fixes.
Thank you.



-- 
Ruslan Dautkhanov

On Tue, Sep 19, 2017 at 11:32 AM, Bolke de Bruin  wrote:

> Should be fixed in master now.
>
> Sent from my iPhone
>
> > On 19 Sep 2017, at 16:52, Ruslan Dautkhanov 
> wrote:
> >
> > That happens in UI when click on a task instance -> View Log.
> > It shows just this error. No exception stack - nothing in the webserver
> log
> > too.
> > Thanks.
> >
> >
> > --
> > Ruslan Dautkhanov
> >
> >> On Tue, Sep 19, 2017 at 2:21 AM, Bolke de Bruin 
> wrote:
> >>
> >> Can you report the stack trace please?
> >>
> >> Cheers
> >> Bolke
> >>
> >>> On 19 Sep 2017, at 08:55, Ruslan Dautkhanov 
> >> wrote:
> >>>
> >>> Testing master..
> >>>
> >>> Old workflow that worked just fine on 1.8.0 now complains
> >>>
> >>> Task log handler file.task does not support read logs.
> >>>> 'TaskInstance' object has no attribute 'task'
> >>>
> >>>
> >>> Any ideas? It might be related to latest logging changes?
> >>>
> >>>
> >>> --
> >>> Ruslan Dautkhanov
> >>
> >>
>


Re: Exception in worker when pickling DAGs

2017-10-23 Thread Ruslan Dautkhanov
I looked over cloudpickle that Alek mentioned. Looks cool - thanks for
referencing that.
https://github.com/cloudpipe/cloudpickle
It could be a drop-in replacement for pickle and a low-hanging fruit to fix
this issue.
I don't see this to be an issue storing cloudpickled objects in a database
table's blob field.
Thanks.


-- 
Ruslan Dautkhanov

On Thu, Oct 12, 2017 at 2:56 PM, Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> One issue that's been standing in the way is the fact that Jinja template
> objects are not pickleable. That and the fact that when people pass objects
> into their DAG objects (through params, callbacks or whatever other ways),
> the serialization can get tangled and pickles become gigantic. People
> typically don't understand the implications in that context.
>
> For now there's a workaround the Jinja template pickling issue that limits
> what you can do with Jinja (you'll notice that extends and imports just
> won't work in a pickle/remote setting).
>
> I remember spending a few hours on trying to pickle jinja templates in the
> first weeks of the project and ultimately giving up. I'm sure someone could
> get that working.
>
> Here's another related question, is the database a proper transport layer
> for pickles? It feels like a hack to me...
>
> Another idea that was discussed was to create a new BaseDagFetcher
> abstraction, along with a replacement for the current implementation:
> FilesystemDagFecher. Then people could write/use whatever other
> implementations like S3ZipDagFetcher, HDFSDagFetcher, PickleDagFetcher,
> 
>
> Max
>
> On Thu, Oct 12, 2017 at 12:29 PM, Alek Storm  wrote:
>
> > That's disappointing. The promise of being able to deploy code just to
> the
> > Airflow master, and have that automatically propagated to workers, was a
> > major selling point for us when we chose Airflow over its alternatives -
> it
> > would greatly simplify our deploy tooling, and free us from having to
> worry
> > about DAG definitions getting out of sync between the master and workers.
> >
> > Perhaps the cloudpickle library, which came out of PySpark, could help
> > here: https://github.com/cloudpipe/cloudpickle. It appears to be
> > specifically designed for shipping Python code over a network.
> >
> > Alek
> >
> > On Thu, Oct 12, 2017 at 2:04 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> > > FYI: there's been talks on deprecating pickling altogether as it's very
> > > brittle.
> > >
> > > Max
> > >
> > > On Thu, Oct 12, 2017 at 10:45 AM, Alek Storm 
> > wrote:
> > >
> > > > Can anyone help with this? Has anyone successfully used Airflow with
> > > > pickling turned on that can give details on their setup?
> > > >
> > > > Thanks,
> > > > Alek
> > > >
> > > > On Mon, Oct 9, 2017 at 2:00 PM, Alek Storm 
> > wrote:
> > > >
> > > > > Yes, everything's correctly imported - everything works fine when I
> > run
> > > > > the scheduler without pickling turned on.
> > > > >
> > > > > Thanks,
> > > > > Alek
> > > > >
> > > > > On Mon, Oct 9, 2017 at 1:19 PM, Edgar Rodriguez <
> > > > > edgar.rodrig...@airbnb.com.invalid> wrote:
> > > > >
> > > > >> The relevant part seems to be:
> > > > >>
> > > > >> ImportError: No module named
> > > > >> unusual_prefix_9b311bfeb8bf0fca09b0857b2b60fb
> > a16effe386_fetch_orgmast
> > > > >> [2017-10-07 13:18:22,155: ERROR/ForkPoolWorker-5] Command 'airflow
> > run
> > > > >> fetch_orgmast latest_only 2017-10-07T13:18:07.489500 --pickle 599
> > > > >> --local -sd /site/conf/airflow/dags/data/fetch_orgmast.py'
> returned
> > > > >> non-zero exit status 1
> > > > >>
> > > > >> Did you check that your task and script `fetch_orgmast.py` are
> > > correctly
> > > > >> importing all modules that they use?
> > > > >>
> > > > >> Cheers,
> > > > >> Edgar
> > > > >>
> > > > >> On Sat, Oct 7, 2017 at 11:25 AM, Alek Storm  >
> > > > wrote:
> > > > >>
> > > > >> > Hi all,
> > > > >> >
> > > > >> > When running the scheduler as airflow scheduler -p and the
> worke

Re: Exception in worker when pickling DAGs

2017-10-23 Thread Ruslan Dautkhanov
Thanks Bolke. Good points pickling vs external API..
Here's a bit on cloudpickle performance comparison
https://github.com/cloudpipe/cloudpickle/issues/58
https://github.com/cloudpipe/cloudpickle/issues/44
https://github.com/RaRe-Technologies/gensim/issues/558
DAGs aren't normally that large to be concerned much with mem/performance?
(well, at least in cases we work with)


Best regards,
Ruslan

On Mon, Oct 23, 2017 at 11:59 AM, Bolke de Bruin  wrote:

> But cloudpickle looks promising. What’s the speed, mem requirements?
>
> Bolke
>
> Verstuurd vanaf mijn iPad
>
> > Op 23 okt. 2017 om 19:57 heeft Bolke de Bruin  het
> volgende geschreven:
> >
> > The other option is to use zipped dags and have those picked up by the
> workers from the api. This is less error prone than pickling (marshmallow,
> cloudpickle). I have a working prototype for this, but needs to be updated
> to the current airflow.
> >
> > Another option is to use copyreg for fields that are difficult the
> serialize and apply that to the jinja2 template fields. This allows one to
> pickle.
> >
> > But I think we should deprecate pickling all together and move over to
> something external/api wise.
> >
> > Cheers
> > Bolke
> >
> > Verstuurd vanaf mijn iPad
> >
> >> Op 23 okt. 2017 om 19:35 heeft Ruslan Dautkhanov 
> het volgende geschreven:
> >>
> >> I looked over cloudpickle that Alek mentioned. Looks cool - thanks for
> >> referencing that.
> >> https://github.com/cloudpipe/cloudpickle
> >> It could be a drop-in replacement for pickle and a low-hanging fruit to
> fix
> >> this issue.
> >> I don't see this to be an issue storing cloudpickled objects in a
> database
> >> table's blob field.
> >> Thanks.
> >>
> >>
> >> --
> >> Ruslan Dautkhanov
> >>
> >> On Thu, Oct 12, 2017 at 2:56 PM, Maxime Beauchemin <
> >> maximebeauche...@gmail.com> wrote:
> >>
> >>> One issue that's been standing in the way is the fact that Jinja
> template
> >>> objects are not pickleable. That and the fact that when people pass
> objects
> >>> into their DAG objects (through params, callbacks or whatever other
> ways),
> >>> the serialization can get tangled and pickles become gigantic. People
> >>> typically don't understand the implications in that context.
> >>>
> >>> For now there's a workaround the Jinja template pickling issue that
> limits
> >>> what you can do with Jinja (you'll notice that extends and imports just
> >>> won't work in a pickle/remote setting).
> >>>
> >>> I remember spending a few hours on trying to pickle jinja templates in
> the
> >>> first weeks of the project and ultimately giving up. I'm sure someone
> could
> >>> get that working.
> >>>
> >>> Here's another related question, is the database a proper transport
> layer
> >>> for pickles? It feels like a hack to me...
> >>>
> >>> Another idea that was discussed was to create a new BaseDagFetcher
> >>> abstraction, along with a replacement for the current implementation:
> >>> FilesystemDagFecher. Then people could write/use whatever other
> >>> implementations like S3ZipDagFetcher, HDFSDagFetcher, PickleDagFetcher,
> >>> 
> >>>
> >>> Max
> >>>
> >>>> On Thu, Oct 12, 2017 at 12:29 PM, Alek Storm 
> wrote:
> >>>>
> >>>> That's disappointing. The promise of being able to deploy code just to
> >>> the
> >>>> Airflow master, and have that automatically propagated to workers,
> was a
> >>>> major selling point for us when we chose Airflow over its
> alternatives -
> >>> it
> >>>> would greatly simplify our deploy tooling, and free us from having to
> >>> worry
> >>>> about DAG definitions getting out of sync between the master and
> workers.
> >>>>
> >>>> Perhaps the cloudpickle library, which came out of PySpark, could help
> >>>> here: https://github.com/cloudpipe/cloudpickle. It appears to be
> >>>> specifically designed for shipping Python code over a network.
> >>>>
> >>>> Alek
> >>>>
> >>>> On Thu, Oct 12, 2017 at 2:04 PM, Maxime Beauchemin <
> >>>> maximebeauche...@gmail.com> wrote:
> >>>>
> >>>>> FYI

Re: [VOTE] Airflow 1.9.0rc1

2017-11-08 Thread Ruslan Dautkhanov
+1



-- 
Ruslan Dautkhanov

On Wed, Nov 8, 2017 at 10:46 AM, Chris Riccomini 
wrote:

> Anyone? :/
>
> On Mon, Nov 6, 2017 at 1:22 PM, Chris Riccomini 
> wrote:
>
> > Hey all,
> >
> > I have cut Airflow 1.9.0 RC1. This email is calling a vote on the
> release,
> > which will last fo 72 hours. Consider this my (binding) +1.
> >
> > Airflow 1.9.0 RC1 is available at:
> >
> > https://dist.apache.org/repos/dist/dev/incubator/airflow/1.9.0rc1/
> >
> > apache-airflow-1.9.0rc1+incubating-source.tar.gz is a source release
> that
> > comes with INSTALL instructions.
> > apache-airflow-1.9.0rc1+incubating-bin.tar.gz is the binary Python
> > "sdist" release.
> >
> > Public keys are available at:
> >
> > https://dist.apache.org/repos/dist/release/incubator/airflow/
> >
> > The release contains the following JIRAs:
> >
> > ISSUE ID|DESCRIPTION   |PR
> > |COMMIT
> > AIRFLOW-1779|Add keepalive packets to ssh hook |#2749
> > |d2f9d1
> > AIRFLOW-1776|stdout/stderr logging not captured|#2745
> > |590d9f
> > AIRFLOW-1771|Change heartbeat text from boom to heartbeat  |- |-
> >
> > AIRFLOW-1767|Airflow Scheduler no longer schedules DAGs|- |-
> >
> > AIRFLOW-1765|Default API auth backed should deny all.  |#2737
> > |6ecdac
> > AIRFLOW-1764|Web Interface should not use experimental api |#2738
> > |6bed1d
> > AIRFLOW-1757|Contrib.SparkSubmitOperator should allow --package|#2725
> > |4e06ee
> > AIRFLOW-1745|BashOperator ignores SIGPIPE in subprocess|#2714
> > |e021c9
> > AIRFLOW-1744|task.retries can be False |#2713
> > |6144c6
> > AIRFLOW-1743|Default config template should not contain ldap fi|#2712
> > |270684
> > AIRFLOW-1741|Task Duration shows two charts on first page load.|#2711
> > |974b49
> > AIRFLOW-1734|Sqoop Operator contains logic errors & needs optio|#2703
> > |f6810c
> > AIRFLOW-1731|Import custom config on PYTHONPATH|#2721
> > |f07eb3
> > AIRFLOW-1726|Copy Expert command for Postgres Hook |#2698
> > |8a4ad3
> > AIRFLOW-1719|Fix small typo - your vs you  |- |-
> >
> > AIRFLOW-1712|Log SSHOperator output|- |-
> >
> > AIRFLOW-1711|Ldap Attributes not always a "list" part 2|#2731
> > |40a936
> > AIRFLOW-1706|Scheduler is failed on startup with MS SQL Server |#2733
> > |9e209b
> > AIRFLOW-1698|Remove confusing SCHEDULER_RUNS env var from syste|#2677
> > |00dd06
> > AIRFLOW-1695|Redshift Hook using boto3 & AWS Hook  |#2717
> > |bfddae
> > AIRFLOW-1694|Hive Hooks: Python 3 does not have an `itertools.i|#2674
> > |c6e5ae
> > AIRFLOW-1692|Master cannot be checked out on windows   |#2673
> > |31805e
> > AIRFLOW-1691|Add better documentation for Google cloud storage |#2671
> > |ace2b1
> > AIRFLOW-1690|Error messages regarding gcs log commits are spars|#2670
> > |5fb5cd
> > AIRFLOW-1682|S3 task handler never writes to S3|#2664
> > |0080f0
> > AIRFLOW-1678|Fix docstring errors for `set_upstream` and `set_d|- |-
> >
> > AIRFLOW-1677|Fix typo in example_qubole_operator   |- |-
> >
> > AIRFLOW-1676|GCS task handler never writes to GCS  |#2659
> > |781fa4
> > AIRFLOW-1675|Fix API docstrings to be properly rendered|#2667
> > |f12381
> > AIRFLOW-1671|Missing @apply_defaults annotation for gcs downloa|#2655
> > |97666b
> > AIRFLOW-1669|Fix Docker import in Master   |#na
> >  |f7f2a8
> > AIRFLOW-1668|Redhsift requires a keep alive of < 300s  |#2650
> > |f2bb77
> > AIRFLOW-1664|Make MySqlToGoogleCloudStorageOperator support bin|#2649
> > |95813d
> > AIRFLOW-1660|Change webpage width to full-width|#2646
> > |8ee3d9
> > AIRFLOW-1659|Fix invalid attribute bug in FileTaskHandler  |#2645
> > |bee823
> > AIRFLOW-1658|Kill (possibly) still running Druid indexing job a|#2644
> > |cbf7ad
> > AIRFLOW-1657|Handle failure of Qubole Operator for s3distcp had|- |-
> >
> > AIRFLOW-1654|Show tooltips for link icons in DAGs view |#2642
> > |ada7b2
> > AIRFLOW-1647|Fix Spark-sql hook|#2637
> > |b1e5c6
> > AIRFLOW-1641|Task gets stuck in queued state   |#2715
> > |735497
> > AIRFLOW-1640|Ad

Re: RBAC Update

2017-11-17 Thread Ruslan Dautkhanov
That's awesome.

1. would it be possible to map an ldap group for example to view level
access roles?
2. screenshots would be nice

Thank you.



-- 
Ruslan Dautkhanov

On Fri, Nov 17, 2017 at 2:44 PM, Joy Gao  wrote:

> Hi guys.
>
> I've been working on moving airflow from Flask-Admin to Flask-AppBuilder
> for RBAC
> <https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+RBAC+proposal
> >,
> check it out at https://github.com/wepay/airflow-webserver.
>
> It's still a work-in-progress, but most features you see in the webserver
> UI today is available there. For those who are interested in RBAC, I'd love
> to get some early feedback in terms of the following:
>
> - New Flask-AppBuilder UI (any bugs/regressions)
> - Setup issues
> - Ease of integration with third party auth (i.e. LDAP, AD, OAuth, etc.)
> - Any other thoughts/concerns
>
> Thanks a lot!
>
> Cheers,
> Joy
>


Re: Data lineage and data portal

2017-11-27 Thread Ruslan Dautkhanov
+1

Thank you


On Mon, Nov 27, 2017 at 12:38 PM Gerard Toonstra 
wrote:

> Hi all,
>
> So something that really drew my attention recently is a "data portal"  as
> described by a team from airbnb somewhere in May. The idea is basically a
> "facebook of data":
>
>
>
> https://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770
>
>
> Unfortunately it looks like it's not going to be opensourced due to how
> heavily integrated it is with their specific infrastructure; but the idea
> itself to me sounds like it's something every organization of a certain
> size should have to keep track of data and stay informed as an
> organization.
>
> Based on the descriptions, I prototyped some things away and am happy with
> the results and the speed that something like this can be constructed. I'm
> now working on sql scanners, extractors and other tools that allow me to
> populate the database and put a poc together on some real data.
>
> If other people have similar concerns in their organization and think this
> would be a great thing to have, reply to me or the list; with sufficient
> interest I may set up a web chat/meet session so this can be discussed in
> more detail and find ways to progress this.
>
>
> Best regards,
>
> Gerard
>


Re: Data lineage and data portal

2017-11-27 Thread Ruslan Dautkhanov
‘’’
I'm
now working on sql scanners, extractors and other tools that allow me to
populate the database
‘’’

Very cool. Cloudera Navigator ( not an open source product) does this too
to some extent - collect metadata and create data lineage automatically (
stored as a Solr collection) by parsing sql queries.

https://www.cloudera.com/documentation/enterprise/5-12-x/topics/datamgmt_extraction_indexing.html



On Mon, Nov 27, 2017 at 12:38 PM Gerard Toonstra 
wrote:

> Hi all,
>
> So something that really drew my attention recently is a "data portal"  as
> described by a team from airbnb somewhere in May. The idea is basically a
> "facebook of data":
>
>
>
> https://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770
>
>
> Unfortunately it looks like it's not going to be opensourced due to how
> heavily integrated it is with their specific infrastructure; but the idea
> itself to me sounds like it's something every organization of a certain
> size should have to keep track of data and stay informed as an
> organization.
>
> Based on the descriptions, I prototyped some things away and am happy with
> the results and the speed that something like this can be constructed. I'm
> now working on sql scanners, extractors and other tools that allow me to
> populate the database and put a poc together on some real data.
>
> If other people have similar concerns in their organization and think this
> would be a great thing to have, reply to me or the list; with sufficient
> interest I may set up a web chat/meet session so this can be discussed in
> more detail and find ways to progress this.
>
>
> Best regards,
>
> Gerard
>


tabulate 0.82

2018-01-16 Thread Ruslan Dautkhanov
We have tabulate 0.8.2.. requirements demand tabulate<0.8.0,>=0.7.5

Are there known issues with tabulate versions higher than 0.8.0?


$ airflow kerberos -D
>


> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/bin/airflow", line 4, in 
>
> __import__('pkg_resources').require('apache-airflow==1.10.0.dev0+incubating')
>   File
> "/opt/cloudera/parcels/Anaconda/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py",
> line 2985, in 
>   File
> "/opt/cloudera/parcels/Anaconda/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py",
> line 2971, in _call_aside
>   File
> "/opt/cloudera/parcels/Anaconda/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py",
> line 2998, in _initialize_master_working_set
>   File
> "/opt/cloudera/parcels/Anaconda/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py",
> line 662, in _build_master
>   File
> "/opt/cloudera/parcels/Anaconda/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py",
> line 675, in _build_from_requirements
>   File
> "/opt/cloudera/parcels/Anaconda/lib/python2.7/site-packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py",
> line 854, in resolve
> pkg_resources.DistributionNotFound: The 'tabulate<0.8.0,>=0.7.5'
> distribution was not found and is required by apache-airflow



-- 
Ruslan Dautkhanov


AttributeError: 'NoneType' object has no attribute 'tzinfo'

2018-01-16 Thread Ruslan Dautkhanov
Upgraded Airflow .. getting following error [1] when processing a DAG

We have 'start_date': None set in default_args.. but this used to work im
previous airflow versions.
This is a '@once DAG.. so we don't need a start_date (no back fill).



[1]

[2018-01-16 16:05:25,283] {models.py:293} ERROR - Failed to import:
> /home/rdautkha/airflow/dags/discover/discover-ora-load-2.py
> Traceback (most recent call last):
>   File
> "/opt/airflow/airflow-20180116/src/apache-airflow/airflow/models.py", line
> 290, in process_file
> m = imp.load_source(mod_name, filepath)
>   File "/home/rdautkha/airflow/dags/discover/discover-ora-load-2.py", line
> 66, in 
> orientation= 'TB',  #
> default graph view
>   File
> "/opt/airflow/airflow-20180116/src/apache-airflow/airflow/models.py", line
> 2951, in __init__
> self.timezone = self.default_args['start_date'].tzinfo
> AttributeError: 'NoneType' object has no attribute 'tzinfo'



-- 
Ruslan Dautkhanov


Re: AttributeError: 'NoneType' object has no attribute 'tzinfo'

2018-01-16 Thread Ruslan Dautkhanov
Fix:

vi "/opt/airflow/airflow-20180116/src/apache-airflow/airflow/models.py"
:2951
changed
self.timezone = self.default_args['start_date'].tzinfo
to
if self.default_args['start_date']:
self.timezone = self.default_args['start_date'].tzinfo
else:
self.timezone = settings.TIMEZONE

let me know if a PR makes sense for you.. at least it fixes things for us.



-- 
Ruslan Dautkhanov

On Tue, Jan 16, 2018 at 4:10 PM, Ruslan Dautkhanov 
wrote:

> Upgraded Airflow .. getting following error [1] when processing a DAG
>
> We have 'start_date': None set in default_args.. but this used to work im
> previous airflow versions.
> This is a '@once DAG.. so we don't need a start_date (no back fill).
>
>
>
> [1]
>
> [2018-01-16 16:05:25,283] {models.py:293} ERROR - Failed to import:
>> /home/rdautkha/airflow/dags/discover/discover-ora-load-2.py
>> Traceback (most recent call last):
>>   File "/opt/airflow/airflow-20180116/src/apache-airflow/airflow/models.py",
>> line 290, in process_file
>> m = imp.load_source(mod_name, filepath)
>>   File "/home/rdautkha/airflow/dags/discover/discover-ora-load-2.py",
>> line 66, in 
>> orientation= 'TB',  #
>> default graph view
>>   File "/opt/airflow/airflow-20180116/src/apache-airflow/airflow/models.py",
>> line 2951, in __init__
>> self.timezone = self.default_args['start_date'].tzinfo
>> AttributeError: 'NoneType' object has no attribute 'tzinfo'
>
>
>
> --
> Ruslan Dautkhanov
>


Re: Airflow thshirts

2018-01-26 Thread Ruslan Dautkhanov
I would love to get one

For some reason reservation is broken to me

Was trying to get reserve one - getting a spinning wheel after I clicked
"Make Reservation"
[image: Spinner]

Browser shows uncaught js exception. Not sure if that's the issue

Uncaught TypeError: Stripe.createToken is not a function
> at e.handleStripePayment (application-06bfcf3dd8133f23390d8e60deb2a0
> 85ec9b0b9f0393c47f99e5e9d8b69eea92.js:145)
> at HTMLInputElement. (application-
> 06bfcf3dd8133f23390d8e60deb2a085ec9b0b9f0393c47f99e5e9d8b69eea92.js:145)
> at HTMLFormElement.dispatch (application-
> 06bfcf3dd8133f23390d8e60deb2a085ec9b0b9f0393c47f99e5e9d8b69eea92.js:30)
> at HTMLFormElement.y.handle (application-
> 06bfcf3dd8133f23390d8e60deb2a085ec9b0b9f0393c47f99e5e9d8b69eea92.js:30)



Since this is a dev list, I thought to bring this up .. doesn't look like
it's an Teespring's Airflow issue though :-)

Was anyone able to get/reserve one?



On Fri, Jan 26, 2018 at 12:52 PM, Jakob Homan  wrote:

> Be sure to run the new design past Brand Management too
> (https://www.apache.org/foundation/marks/ - Using Apache trademarks on
> merchandise).
>
> Thanks for doing this Maxime, it's awesome.
>
> -jg
>
> On 26 January 2018 at 11:06, Maxime Beauchemin
>  wrote:
> > Ooops. Apparently the link didn't work either. I'll do a second try soon
> > and add a feather!
> >
> > Max
> >
> > On Wed, Jan 24, 2018 at 11:45 PM, Sid Anand  wrote:
> >
> >> I couldn't agree more..
> >>
> >> add the feather :_)
> >>
> >> -s
> >>
> >> On Wed, Jan 24, 2018 at 5:49 PM, Jakob Homan  wrote:
> >>
> >> > 
> >> > *Apache* Airflow...
> >> > 
> >> >
> >> > ...
> >> >
> >> > -jg
> >> >
> >> >
> >> > On 24 January 2018 at 14:42, Trent Robbins 
> wrote:
> >> > > Thanks Max!
> >> > >
> >> > > Trent
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > Best,
> >> > >
> >> > > Trent Robbins
> >> > > Strategic Consultant for Open Source Software
> >> > > Tau Informatics LLC
> >> > > desk: 415-404-9452
> >> > > cell: 513-233-5651
> >> > > tr...@tauinformatics.com
> >> > > https://www.linkedin.com/in/trentrobbins
> >> > >
> >> > > On Wed, Jan 24, 2018 at 1:59 PM, Maxime Beauchemin <
> >> > > maximebeauche...@gmail.com> wrote:
> >> > >
> >> > >> After a recent Meetup where many folks asked about where they could
> >> get
> >> > an
> >> > >> Airflow tshift like the one I was wearing, Now I'm trying to get a
> >> > batch of
> >> > >> Airflow tshirts kickstarted on Teespring.
> >> > >>
> >> > >> BTW Teespring uses Airflow, making it the obvious choice for this
> :)
> >> > >>
> >> > >> Order yours here:
> >> > >> https://teespring.com/teespringdirect-nl2qbp#pid=72&;
> >> cid=5902&sid=front
> >> > >>
> >> > >> Profits (if any) will be donated to charity.
> >> > >>
> >> > >> Thanks,
> >> > >>
> >> > >> Max
> >> > >>
> >> >
> >>
>


Re: AttributeError: 'NoneType' object has no attribute 'tzinfo'

2018-04-22 Thread Ruslan Dautkhanov
Created jira and a PR

https://issues.apache.org/jira/browse/AIRFLOW-2351
https://github.com/apache/incubator-airflow/pull/3254

Please help review.




-- 
Ruslan Dautkhanov

On Tue, Jan 16, 2018 at 11:50 PM, Bolke de Bruin  wrote:

> Good point, but you have upgraded to master. Can you provide a PR?
>
> Verstuurd vanaf mijn iPad
>
> > Op 17 jan. 2018 om 00:10 heeft Ruslan Dautkhanov 
> het volgende geschreven:
> >
> > Upgraded Airflow .. getting following error [1] when processing a DAG
> >
> > We have 'start_date': None set in default_args.. but this used to work im
> > previous airflow versions.
> > This is a '@once DAG.. so we don't need a start_date (no back fill).
> >
> >
> >
> > [1]
> >
> > [2018-01-16 16:05:25,283] {models.py:293} ERROR - Failed to import:
> >> /home/rdautkha/airflow/dags/discover/discover-ora-load-2.py
> >> Traceback (most recent call last):
> >>  File
> >> "/opt/airflow/airflow-20180116/src/apache-airflow/airflow/models.py",
> line
> >> 290, in process_file
> >>m = imp.load_source(mod_name, filepath)
> >>  File "/home/rdautkha/airflow/dags/discover/discover-ora-load-2.py",
> line
> >> 66, in 
> >>orientation= 'TB',  #
> >> default graph view
> >>  File
> >> "/opt/airflow/airflow-20180116/src/apache-airflow/airflow/models.py",
> line
> >> 2951, in __init__
> >>self.timezone = self.default_args['start_date'].tzinfo
> >> AttributeError: 'NoneType' object has no attribute 'tzinfo'
> >
> >
> >
> > --
> > Ruslan Dautkhanov
>


k8s example DAGs

2018-04-22 Thread Ruslan Dautkhanov
 Is it possible to make kubernetes examples installed optionally?

We don't use Kubernetes and a bare Airflow install fills logs with
following :

2018-04-22 19:49:04,718 ERROR - Failed to import:
> /opt/airflow/airflow-20180420/src/apache-airflow/airflow/
> example_dags/example_kubernetes_operator.py
> Traceback (most recent call last):
>   File "/opt/airflow/airflow-20180420/src/apache-airflow/airflow/models.py",
> line 300, in process_file
> m = imp.load_source(mod_name, filepath)
>   File "/opt/airflow/airflow-20180420/src/apache-airflow/
> airflow/example_dags/example_kubernetes_operator.py", line 19, in 
> from airflow.contrib.operators.kubernetes_pod_operator import
> KubernetesPodOperator
>   File "/opt/airflow/airflow-20180420/src/apache-airflow/
> airflow/contrib/operators/kubernetes_pod_operator.py", line 21, in
> 
> from airflow.contrib.kubernetes import kube_client, pod_generator,
> pod_launcher
>   File "/opt/airflow/airflow-20180420/src/apache-airflow/
> airflow/contrib/kubernetes/pod_launcher.py", line 25, in 
> from kubernetes import watch
> ImportError: No module named kubernetes


Would be great to make examples driven by what modules installed if they
have external dependencies,


Thanks!

Ruslan Dautkhanov


Re: k8s example DAGs

2018-04-22 Thread Ruslan Dautkhanov
Hi Fokko,

Sure, https://issues.apache.org/jira/browse/AIRFLOW-2358

Thanks!



-- 
Ruslan Dautkhanov

On Mon, Apr 23, 2018 at 12:41 AM, Driesprong, Fokko 
wrote:

> Hi Ruslan,
>
> This is a good point. I also get No module named kubernetes exceptions when
> running the initdb. This should be fixed. Could you create a Jira ticket
> for this?
>
> Cheers, Fokko
>
> 2018-04-23 4:28 GMT+02:00 Ruslan Dautkhanov :
>
> >  Is it possible to make kubernetes examples installed optionally?
> >
> > We don't use Kubernetes and a bare Airflow install fills logs with
> > following :
> >
> > 2018-04-22 19:49:04,718 ERROR - Failed to import:
> > > /opt/airflow/airflow-20180420/src/apache-airflow/airflow/
> > > example_dags/example_kubernetes_operator.py
> > > Traceback (most recent call last):
> > >   File "/opt/airflow/airflow-20180420/src/apache-airflow/
> > airflow/models.py",
> > > line 300, in process_file
> > > m = imp.load_source(mod_name, filepath)
> > >   File "/opt/airflow/airflow-20180420/src/apache-airflow/
> > > airflow/example_dags/example_kubernetes_operator.py", line 19, in
> > 
> > > from airflow.contrib.operators.kubernetes_pod_operator import
> > > KubernetesPodOperator
> > >   File "/opt/airflow/airflow-20180420/src/apache-airflow/
> > > airflow/contrib/operators/kubernetes_pod_operator.py", line 21, in
> > > 
> > > from airflow.contrib.kubernetes import kube_client, pod_generator,
> > > pod_launcher
> > >   File "/opt/airflow/airflow-20180420/src/apache-airflow/
> > > airflow/contrib/kubernetes/pod_launcher.py", line 25, in 
> > > from kubernetes import watch
> > > ImportError: No module named kubernetes
> >
> >
> > Would be great to make examples driven by what modules installed if they
> > have external dependencies,
> >
> >
> > Thanks!
> >
> > Ruslan Dautkhanov
> >
>


Airflow - YARN as an executor?

2018-04-24 Thread Ruslan Dautkhanov
With Hadoop 3's Docker on YARN support, I think YARN becomes
somewhat a competitor for Kubernetes.

Great job on adding k8s support to Airflow.

Very similarly I see Airflow could integrate with YARN and use
its infrastructure as an "executor" .. have anyone explored feasibility of
this approach?


Thanks!
Ruslan Dautkhanov


Re: Airflow - YARN as an executor?

2018-04-24 Thread Ruslan Dautkhanov
Kubernetes is a "monolithic" 1-level scheduler that can't handle what YARN
can - for example schedule tasks local to data.
Hadoop has multiple levels of data locality (node-local, rack-local) - so
computation happens local to data to minimize network
data transfer which is expensive.
K8s wasn't designed to handle this scheduling scenarios, as far as I know.

For cloud deployments where we don't have data locality problem (because of
s3 is being used instead of storage local
to servers), k8s might be okay.

Nice comparison [1] of k8s vs two-level schedulers like yarn and messos ..
although I think it's an offtopic.

We're mostly on-prem and we don't see kubernetes take over yarn any time
soon.

Thanks.



[1]

https://aaltodoc.aalto.fi/bitstream/handle/123456789/27061/master_Ravula_Shashi_2017.pdf?sequence=1

*2.3.2 Monolithic Schedulers *



Monolithic schedulers use a single, centralized scheduling algorithm for
all jobs. All workload is run through the same scheduler and same
scheduling logic. Swarm,
Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
improvised on basic monolithic version of Borg and Swarm schedulers. This
type of schedulers are not suitable for running heterogeneous modern
workloads which include Spark jobs, containers, and other long running jobs,
etc.



*2.3.3 Two Level Schedulers *



Two-level schedulers address the drawbacks of a monolithic scheduler by
separating concerns of resource allocation and task placement. An active
resource manager offers compute resources to multiple parallel, independent
“scheduler frameworks”. The Mesos cluster manager pioneered this approach,
and YARN supports a limited version of it. In Mesos, resources are offered
to application-level schedulers. This allows for custom, workload-specific
scheduling policies. The drawback with this type of scheduling architecture
is that the application level frameworks cannot see all the possible
placement options anymore. Instead, they only see those options that
correspond to resources offered (Mesos) or allocated (YARN) by the resource
manager component. This makes priority preemption (higher priority tasks
kick out lower priority ones) difficult.





-- 
Ruslan Dautkhanov

On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin  wrote:

> Happy to have it as a contrib executor. However, I personally think yarn
> is a dead end. It has a lot of catching up to do and all the momentum is
> with kubernetes.
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov 
> het volgende geschreven:
> >
> > With Hadoop 3's Docker on YARN support, I think YARN becomes
> > somewhat a competitor for Kubernetes.
> >
> > Great job on adding k8s support to Airflow.
> >
> > Very similarly I see Airflow could integrate with YARN and use
> > its infrastructure as an "executor" .. have anyone explored feasibility
> of
> > this approach?
> >
> >
> > Thanks!
> > Ruslan Dautkhanov
>


Re: Airflow - YARN as an executor?

2018-04-25 Thread Ruslan Dautkhanov
Now I think if Airflow on PySpark Executor would be an easier target.
Spark runs on YARN, Mesos and now Kubernetes.
So PySpark Executor would give Airflow porting to these schedulers.
It's my understanding we now have only Spark Operator and not Executor.

Thanks!



-- 
Ruslan Dautkhanov

On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey  wrote:

> Hey I didn’t know this Bolke, I was under the impression of the same as
> Ruslan.
> Thanks for the share
>
> Sent from my iPhone
>
> > On Apr 24, 2018, at 2:12 PM, Bolke de Bruin  wrote:
> >
> > It actually can nowadays: https://cdn.oreillystatic.com/
> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%
> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx
> >
> > We also have an on premise setup with ceph (s3a) and HDFS for when we
> need the speed and kubernetes for our workloads. We are kicking out Yarn
> (and hive etc for that matter).
> >
> > Bolke
> >
> >
> >
> > Verstuurd vanaf mijn iPad
> >
> >> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov 
> het volgende geschreven:
> >>
> >> Kubernetes is a "monolithic" 1-level scheduler that can't handle what
> YARN
> >> can - for example schedule tasks local to data.
> >> Hadoop has multiple levels of data locality (node-local, rack-local) -
> so
> >> computation happens local to data to minimize network
> >> data transfer which is expensive.
> >> K8s wasn't designed to handle this scheduling scenarios, as far as I
> know.
> >>
> >> For cloud deployments where we don't have data locality problem
> (because of
> >> s3 is being used instead of storage local
> >> to servers), k8s might be okay.
> >>
> >> Nice comparison [1] of k8s vs two-level schedulers like yarn and messos
> ..
> >> although I think it's an offtopic.
> >>
> >> We're mostly on-prem and we don't see kubernetes take over yarn any time
> >> soon.
> >>
> >> Thanks.
> >>
> >>
> >>
> >> [1]
> >>
> >> https://aaltodoc.aalto.fi/bitstream/handle/123456789/
> 27061/master_Ravula_Shashi_2017.pdf?sequence=1
> >>
> >> *2.3.2 Monolithic Schedulers *
> >>
> >>
> >>
> >> Monolithic schedulers use a single, centralized scheduling algorithm for
> >> all jobs. All workload is run through the same scheduler and same
> >> scheduling logic. Swarm,
> >> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
> >> improvised on basic monolithic version of Borg and Swarm schedulers.
> This
> >> type of schedulers are not suitable for running heterogeneous modern
> >> workloads which include Spark jobs, containers, and other long running
> jobs,
> >> etc.
> >>
> >>
> >>
> >> *2.3.3 Two Level Schedulers *
> >>
> >>
> >>
> >> Two-level schedulers address the drawbacks of a monolithic scheduler by
> >> separating concerns of resource allocation and task placement. An active
> >> resource manager offers compute resources to multiple parallel,
> independent
> >> “scheduler frameworks”. The Mesos cluster manager pioneered this
> approach,
> >> and YARN supports a limited version of it. In Mesos, resources are
> offered
> >> to application-level schedulers. This allows for custom,
> workload-specific
> >> scheduling policies. The drawback with this type of scheduling
> architecture
> >> is that the application level frameworks cannot see all the possible
> >> placement options anymore. Instead, they only see those options that
> >> correspond to resources offered (Mesos) or allocated (YARN) by the
> resource
> >> manager component. This makes priority preemption (higher priority tasks
> >> kick out lower priority ones) difficult.
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Ruslan Dautkhanov
> >>
> >>> On Tue, Apr 24, 2018 at 2:22 PM, Bolke de Bruin 
> wrote:
> >>>
> >>> Happy to have it as a contrib executor. However, I personally think
> yarn
> >>> is a dead end. It has a lot of catching up to do and all the momentum
> is
> >>> with kubernetes.
> >>>
> >>> B.
> >>>
> >>> Verstuurd vanaf mijn iPad
> >>>
> >>>> Op 24 apr. 2018 om 22:13 heeft Ruslan Dautkhanov <
> dautkha...@gmail.com>
> >>> het volgende geschreven:
> >>>>
> >>>> With Hadoop 3's Docker on YARN support, I think YARN becomes
> >>>> somewhat a competitor for Kubernetes.
> >>>>
> >>>> Great job on adding k8s support to Airflow.
> >>>>
> >>>> Very similarly I see Airflow could integrate with YARN and use
> >>>> its infrastructure as an "executor" .. have anyone explored
> feasibility
> >>> of
> >>>> this approach?
> >>>>
> >>>>
> >>>> Thanks!
> >>>> Ruslan Dautkhanov
> >>>
>


Re: Airflow - YARN as an executor?

2018-04-25 Thread Ruslan Dautkhanov
I used "Executor" as an Airflow term, not meant spark executor ...
Like Spark would be one of Executors
in here
https://github.com/apache/incubator-airflow/tree/master/airflow/executors
or in here
https://github.com/apache/incubator-airflow/tree/master/airflow/contrib/executors

Thanks.



-- 
Ruslan Dautkhanov

On Wed, Apr 25, 2018 at 9:17 AM, Bolke de Bruin  wrote:

> Im a bit lost on the spark executor to be honest. To my knowledge the
> spark driver creates spark executors which run spark code. In other words
> in can’t arbitrarily run generic code. Or can it?
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 25 apr. 2018 om 17:11 heeft Ruslan Dautkhanov 
> het volgende geschreven:
> >
> > Now I think if Airflow on PySpark Executor would be an easier target.
> > Spark runs on YARN, Mesos and now Kubernetes.
> > So PySpark Executor would give Airflow porting to these schedulers.
> > It's my understanding we now have only Spark Operator and not Executor.
> >
> > Thanks!
> >
> >
> >
> > --
> > Ruslan Dautkhanov
> >
> >> On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey 
> wrote:
> >>
> >> Hey I didn’t know this Bolke, I was under the impression of the same as
> >> Ruslan.
> >> Thanks for the share
> >>
> >> Sent from my iPhone
> >>
> >>> On Apr 24, 2018, at 2:12 PM, Bolke de Bruin  wrote:
> >>>
> >>> It actually can nowadays: https://cdn.oreillystatic.com/
> >> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%
> >> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx
> >>>
> >>> We also have an on premise setup with ceph (s3a) and HDFS for when we
> >> need the speed and kubernetes for our workloads. We are kicking out Yarn
> >> (and hive etc for that matter).
> >>>
> >>> Bolke
> >>>
> >>>
> >>>
> >>> Verstuurd vanaf mijn iPad
> >>>
> >>>> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov <
> dautkha...@gmail.com>
> >> het volgende geschreven:
> >>>>
> >>>> Kubernetes is a "monolithic" 1-level scheduler that can't handle what
> >> YARN
> >>>> can - for example schedule tasks local to data.
> >>>> Hadoop has multiple levels of data locality (node-local, rack-local) -
> >> so
> >>>> computation happens local to data to minimize network
> >>>> data transfer which is expensive.
> >>>> K8s wasn't designed to handle this scheduling scenarios, as far as I
> >> know.
> >>>>
> >>>> For cloud deployments where we don't have data locality problem
> >> (because of
> >>>> s3 is being used instead of storage local
> >>>> to servers), k8s might be okay.
> >>>>
> >>>> Nice comparison [1] of k8s vs two-level schedulers like yarn and
> messos
> >> ..
> >>>> although I think it's an offtopic.
> >>>>
> >>>> We're mostly on-prem and we don't see kubernetes take over yarn any
> time
> >>>> soon.
> >>>>
> >>>> Thanks.
> >>>>
> >>>>
> >>>>
> >>>> [1]
> >>>>
> >>>> https://aaltodoc.aalto.fi/bitstream/handle/123456789/
> >> 27061/master_Ravula_Shashi_2017.pdf?sequence=1
> >>>>
> >>>> *2.3.2 Monolithic Schedulers *
> >>>>
> >>>>
> >>>>
> >>>> Monolithic schedulers use a single, centralized scheduling algorithm
> for
> >>>> all jobs. All workload is run through the same scheduler and same
> >>>> scheduling logic. Swarm,
> >>>> Fleet, Borg and Kubernetes adopt monolithic schedulers. Kubernetes
> >>>> improvised on basic monolithic version of Borg and Swarm schedulers.
> >> This
> >>>> type of schedulers are not suitable for running heterogeneous modern
> >>>> workloads which include Spark jobs, containers, and other long running
> >> jobs,
> >>>> etc.
> >>>>
> >>>>
> >>>>
> >>>> *2.3.3 Two Level Schedulers *
> >>>>
> >>>>
> >>>>
> >>>> Two-level schedulers address the drawbacks of a monolithic scheduler
> by
> >>>> separating concerns of resource allocation and task placement. An
> active
> >>&

Re: Airflow - YARN as an executor?

2018-04-25 Thread Ruslan Dautkhanov
As long as that code is serializable (through pickle, cloudpickle or any
other Python code serializaers ),
the answer should be yes.

Thanks.



-- 
Ruslan Dautkhanov

On Wed, Apr 25, 2018 at 9:54 AM, Taylor Edmiston 
wrote:

> Is it possible for the (hypothetical) Airflow SparkExecutor to handle
> general execution of any operator (i.e., run non-Spark code)?
>
> *Taylor Edmiston*
> Blog <http://blog.tedmiston.com> | Stack Overflow CV
> <https://stackoverflow.com/story/taylor> | LinkedIn
> <https://www.linkedin.com/in/tedmiston/> | AngelList
> <https://angel.co/taylor>
>
>
> On Wed, Apr 25, 2018 at 11:22 AM, Ruslan Dautkhanov 
> wrote:
>
> > I used "Executor" as an Airflow term, not meant spark executor ...
> > Like Spark would be one of Executors
> > in here
> > https://github.com/apache/incubator-airflow/tree/master/
> airflow/executors
> > or in here
> > https://github.com/apache/incubator-airflow/tree/master/
> > airflow/contrib/executors
> >
> > Thanks.
> >
> >
> >
> > --
> > Ruslan Dautkhanov
> >
> > On Wed, Apr 25, 2018 at 9:17 AM, Bolke de Bruin 
> wrote:
> >
> > > Im a bit lost on the spark executor to be honest. To my knowledge the
> > > spark driver creates spark executors which run spark code. In other
> words
> > > in can’t arbitrarily run generic code. Or can it?
> > >
> > > B.
> > >
> > > Verstuurd vanaf mijn iPad
> > >
> > > > Op 25 apr. 2018 om 17:11 heeft Ruslan Dautkhanov <
> dautkha...@gmail.com
> > >
> > > het volgende geschreven:
> > > >
> > > > Now I think if Airflow on PySpark Executor would be an easier target.
> > > > Spark runs on YARN, Mesos and now Kubernetes.
> > > > So PySpark Executor would give Airflow porting to these schedulers.
> > > > It's my understanding we now have only Spark Operator and not
> Executor.
> > > >
> > > > Thanks!
> > > >
> > > >
> > > >
> > > > --
> > > > Ruslan Dautkhanov
> > > >
> > > >> On Tue, Apr 24, 2018 at 3:20 PM, Ace Haidrey 
> > > wrote:
> > > >>
> > > >> Hey I didn’t know this Bolke, I was under the impression of the same
> > as
> > > >> Ruslan.
> > > >> Thanks for the share
> > > >>
> > > >> Sent from my iPhone
> > > >>
> > > >>> On Apr 24, 2018, at 2:12 PM, Bolke de Bruin 
> > wrote:
> > > >>>
> > > >>> It actually can nowadays: https://cdn.oreillystatic.com/
> > > >> en/assets/1/event/269/HDFS%20on%20Kubernetes_%20Tech%
> > > >> 20deep%20dive%20on%20locality%20and%20security%20Presentation.pptx
> > > >>>
> > > >>> We also have an on premise setup with ceph (s3a) and HDFS for when
> we
> > > >> need the speed and kubernetes for our workloads. We are kicking out
> > Yarn
> > > >> (and hive etc for that matter).
> > > >>>
> > > >>> Bolke
> > > >>>
> > > >>>
> > > >>>
> > > >>> Verstuurd vanaf mijn iPad
> > > >>>
> > > >>>> Op 24 apr. 2018 om 22:50 heeft Ruslan Dautkhanov <
> > > dautkha...@gmail.com>
> > > >> het volgende geschreven:
> > > >>>>
> > > >>>> Kubernetes is a "monolithic" 1-level scheduler that can't handle
> > what
> > > >> YARN
> > > >>>> can - for example schedule tasks local to data.
> > > >>>> Hadoop has multiple levels of data locality (node-local,
> > rack-local) -
> > > >> so
> > > >>>> computation happens local to data to minimize network
> > > >>>> data transfer which is expensive.
> > > >>>> K8s wasn't designed to handle this scheduling scenarios, as far
> as I
> > > >> know.
> > > >>>>
> > > >>>> For cloud deployments where we don't have data locality problem
> > > >> (because of
> > > >>>> s3 is being used instead of storage local
> > > >>>> to servers), k8s might be okay.
> > > >>>>
> > > >>>> Nice comparison [1] of k8s vs two-level schedulers like yarn and
> > > messos
> > > >> ..
> > > >>&g

Re: tabulate 0.82

2018-05-16 Thread Ruslan Dautkhanov
https://issues.apache.org/jira/browse/AIRFLOW-2476
https://github.com/apache/incubator-airflow/pull/3366/

Thanks!



-- 
Ruslan Dautkhanov

On Tue, Jan 16, 2018 at 11:51 PM, Bolke de Bruin  wrote:

> It is a protection against major upgraded that are not backwards
> compatible. Please test and provide a PR if it is ok.
>
> Cheers
> Bolke
>
> Verstuurd vanaf mijn iPad
>
> > Op 16 jan. 2018 om 22:36 heeft Ruslan Dautkhanov 
> het volgende geschreven:
> >
> > We have tabulate 0.8.2.. requirements demand tabulate<0.8.0,>=0.7.5
> >
> > Are there known issues with tabulate versions higher than 0.8.0?
> >
> >
> > $ airflow kerberos -D
> >>
> >
> >
> >> Traceback (most recent call last):
> >>  File "/opt/cloudera/parcels/Anaconda/bin/airflow", line 4, in 
> >>
> >> __import__('pkg_resources').require('apache-airflow==1.10.
> 0.dev0+incubating')
> >>  File
> >> "/opt/cloudera/parcels/Anaconda/lib/python2.7/site-
> packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py",
> >> line 2985, in 
> >>  File
> >> "/opt/cloudera/parcels/Anaconda/lib/python2.7/site-
> packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py",
> >> line 2971, in _call_aside
> >>  File
> >> "/opt/cloudera/parcels/Anaconda/lib/python2.7/site-
> packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py",
> >> line 2998, in _initialize_master_working_set
> >>  File
> >> "/opt/cloudera/parcels/Anaconda/lib/python2.7/site-
> packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py",
> >> line 662, in _build_master
> >>  File
> >> "/opt/cloudera/parcels/Anaconda/lib/python2.7/site-
> packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py",
> >> line 675, in _build_from_requirements
> >>  File
> >> "/opt/cloudera/parcels/Anaconda/lib/python2.7/site-
> packages/setuptools-27.2.0-py2.7.egg/pkg_resources/__init__.py",
> >> line 854, in resolve
> >> pkg_resources.DistributionNotFound: The 'tabulate<0.8.0,>=0.7.5'
> >> distribution was not found and is required by apache-airflow
> >
> >
> >
> > --
> > Ruslan Dautkhanov
>