Re: Apache Airflow / Cloud Composer workshops Amsterdam

2018-10-12 Thread Ben Gregory
Hey Fokko!

Sounds like a great event! Will any of the talks/workshops be
streamed/livecast/recorded for those of us who can't make it to Amsterdam?

- Ben

On Fri, Oct 12, 2018 at 12:40 PM Driesprong, Fokko 
wrote:

> Hi all,
>
> From October 15-19, 2018, GoDataFest takes place in Amsterdam, The
> Netherlands. This week is dedicated to data technology and features free
> talks, training sessions and workshops.
>
> Leading tech companies, like AWS (Monday, October 15), Dataiku (Tuesday,
> October 16), Databricks (Wednesday, October 17), and Google Cloud
> (Thursday, October 18) each host an entire day to share their latest
> innovations. The final day, Friday, October 19, is dedicated to
> open-source, including Apache Airflow. During the open-source day, October
> 19, we organize a free Airflow workshop, taking place from 15:00 – 17:00.
>
> Feel free to mix-and-match activities to create your ultimate and personal
> data festival. Make sure to register directly, as seats are limited.
> http://www.godatafest.com/
>
> Cheers, Fokko
>


-- 

[image: Astronomer Logo] 

*Ben Gregory*
Data Engineer

Mobile: +1-615-483-3653 • Online: astronomer.io 

Download our new ebook.  From Volume
to Value - A Guide to Data Engineering.


Apache Airflow / Cloud Composer workshops Amsterdam

2018-10-12 Thread Driesprong, Fokko
Hi all,

>From October 15-19, 2018, GoDataFest takes place in Amsterdam, The
Netherlands. This week is dedicated to data technology and features free
talks, training sessions and workshops.

Leading tech companies, like AWS (Monday, October 15), Dataiku (Tuesday,
October 16), Databricks (Wednesday, October 17), and Google Cloud
(Thursday, October 18) each host an entire day to share their latest
innovations. The final day, Friday, October 19, is dedicated to
open-source, including Apache Airflow. During the open-source day, October
19, we organize a free Airflow workshop, taking place from 15:00 – 17:00.

Feel free to mix-and-match activities to create your ultimate and personal
data festival. Make sure to register directly, as seats are limited.
http://www.godatafest.com/

Cheers, Fokko


Re: Ingest daily data, but delivery is always delayed by two days

2018-10-12 Thread James Meickle
For something to add to Airflow itself: I would love a more flexible
mapping between data time and processing time. The default is "n-1" (day
over day, you're aiming to process yesterday's data) but people post other
use cases on this mailing list quite frequently.

On Fri, Oct 12, 2018 at 7:46 AM Faouz El Fassi  wrote:

> What about an exponential back off on the poke interval?
>
> On Fri, 12 Oct 2018, 13:01 Ash Berlin-Taylor,  wrote:
>
> > That would work for some of our other uses cases (and has been an idea in
> > our backlog for months) but not this case as we're reading from someone
> > else's bucket so can't set up notifications etc. :(
> >
> > -ash
> >
> > > On 12 Oct 2018, at 11:57, Bolke de Bruin  wrote:
> > >
> > > S3 Bucket notification that triggers a dag?
> > >
> > > Verstuurd vanaf mijn iPad
> > >
> > >> Op 12 okt. 2018 om 12:42 heeft Ash Berlin-Taylor  het
> > volgende geschreven:
> > >>
> > >> A lot of our dags are ingesting data (usually daily or weekly) from
> > suppliers, and they are universally late.
> > >>
> > >> In the case I'm setting up now the delivery lag is about 30hours -
> data
> > for 2018-10-10 turned up at 2018-10-12 05:43.
> > >>
> > >> I was going to just set this up with an S3KeySensor and a daily
> > schedule, but I'm wondering if anyone has any other bright ideas for a
> > better way of handling this sort of case:
> > >>
> > >>   dag = DAG(
> > >>   DAG_ID
> > >>   default_args=args,
> > >>   start_date=args['start_date'],
> > >>   concurrency=1,
> > >>   schedule_interval='@daily',
> > >>   params={'country': cc}
> > >>   )
> > >>
> > >>   with dag:
> > >>   task = S3KeySensor(
> > >>   task_id="await_files",
> > >>   bucket_key="s3://bucket/raw/table1-{{ params.country }}/{{
> > execution_date.strftime('%Y/%m/%d') }}/SUCCESS",
> > >>   poke_interval=60 * 60 * 2,
> > >>   timeout=60 * 60 * 72,
> > >>   )
> > >>
> > >> That S3 key sensor is _going_ to fail the first 18 times or so it runs
> > which just seems silly.
> > >>
> > >> One option could be to use `ds_add` or similar on the execution date,
> > but I don't like breaking the (obvious) link between execution date and
> > which files it picks up, so I've ruled out this option
> > >>
> > >> I could use a Time(Delta)Sensor to just delay the start of the
> > checking. I guess with the new change in master to make sensors yield
> their
> > execution slots that's not a terrible plan.
> > >>
> > >> Does anyone else have any other idea, including possible things we
> > could add to Airflow itself.
> > >>
> > >> -ash
> > >>
> >
> >
>


Re: Ingest daily data, but delivery is always delayed by two days

2018-10-12 Thread Faouz El Fassi
What about an exponential back off on the poke interval?

On Fri, 12 Oct 2018, 13:01 Ash Berlin-Taylor,  wrote:

> That would work for some of our other uses cases (and has been an idea in
> our backlog for months) but not this case as we're reading from someone
> else's bucket so can't set up notifications etc. :(
>
> -ash
>
> > On 12 Oct 2018, at 11:57, Bolke de Bruin  wrote:
> >
> > S3 Bucket notification that triggers a dag?
> >
> > Verstuurd vanaf mijn iPad
> >
> >> Op 12 okt. 2018 om 12:42 heeft Ash Berlin-Taylor  het
> volgende geschreven:
> >>
> >> A lot of our dags are ingesting data (usually daily or weekly) from
> suppliers, and they are universally late.
> >>
> >> In the case I'm setting up now the delivery lag is about 30hours - data
> for 2018-10-10 turned up at 2018-10-12 05:43.
> >>
> >> I was going to just set this up with an S3KeySensor and a daily
> schedule, but I'm wondering if anyone has any other bright ideas for a
> better way of handling this sort of case:
> >>
> >>   dag = DAG(
> >>   DAG_ID
> >>   default_args=args,
> >>   start_date=args['start_date'],
> >>   concurrency=1,
> >>   schedule_interval='@daily',
> >>   params={'country': cc}
> >>   )
> >>
> >>   with dag:
> >>   task = S3KeySensor(
> >>   task_id="await_files",
> >>   bucket_key="s3://bucket/raw/table1-{{ params.country }}/{{
> execution_date.strftime('%Y/%m/%d') }}/SUCCESS",
> >>   poke_interval=60 * 60 * 2,
> >>   timeout=60 * 60 * 72,
> >>   )
> >>
> >> That S3 key sensor is _going_ to fail the first 18 times or so it runs
> which just seems silly.
> >>
> >> One option could be to use `ds_add` or similar on the execution date,
> but I don't like breaking the (obvious) link between execution date and
> which files it picks up, so I've ruled out this option
> >>
> >> I could use a Time(Delta)Sensor to just delay the start of the
> checking. I guess with the new change in master to make sensors yield their
> execution slots that's not a terrible plan.
> >>
> >> Does anyone else have any other idea, including possible things we
> could add to Airflow itself.
> >>
> >> -ash
> >>
>
>


Re: Ingest daily data, but delivery is always delayed by two days

2018-10-12 Thread Ash Berlin-Taylor
That would work for some of our other uses cases (and has been an idea in our 
backlog for months) but not this case as we're reading from someone else's 
bucket so can't set up notifications etc. :(

-ash

> On 12 Oct 2018, at 11:57, Bolke de Bruin  wrote:
> 
> S3 Bucket notification that triggers a dag?
> 
> Verstuurd vanaf mijn iPad
> 
>> Op 12 okt. 2018 om 12:42 heeft Ash Berlin-Taylor  het 
>> volgende geschreven:
>> 
>> A lot of our dags are ingesting data (usually daily or weekly) from 
>> suppliers, and they are universally late.
>> 
>> In the case I'm setting up now the delivery lag is about 30hours - data for 
>> 2018-10-10 turned up at 2018-10-12 05:43.
>> 
>> I was going to just set this up with an S3KeySensor and a daily schedule, 
>> but I'm wondering if anyone has any other bright ideas for a better way of 
>> handling this sort of case:
>> 
>>   dag = DAG(
>>   DAG_ID
>>   default_args=args,
>>   start_date=args['start_date'],
>>   concurrency=1,
>>   schedule_interval='@daily',
>>   params={'country': cc}
>>   )
>> 
>>   with dag:
>>   task = S3KeySensor(
>>   task_id="await_files",
>>   bucket_key="s3://bucket/raw/table1-{{ params.country }}/{{ 
>> execution_date.strftime('%Y/%m/%d') }}/SUCCESS",
>>   poke_interval=60 * 60 * 2,
>>   timeout=60 * 60 * 72,
>>   )
>> 
>> That S3 key sensor is _going_ to fail the first 18 times or so it runs which 
>> just seems silly.
>> 
>> One option could be to use `ds_add` or similar on the execution date, but I 
>> don't like breaking the (obvious) link between execution date and which 
>> files it picks up, so I've ruled out this option
>> 
>> I could use a Time(Delta)Sensor to just delay the start of the checking. I 
>> guess with the new change in master to make sensors yield their execution 
>> slots that's not a terrible plan.
>> 
>> Does anyone else have any other idea, including possible things we could add 
>> to Airflow itself.
>> 
>> -ash
>> 



Ingest daily data, but delivery is always delayed by two days

2018-10-12 Thread Ash Berlin-Taylor
A lot of our dags are ingesting data (usually daily or weekly) from suppliers, 
and they are universally late.

In the case I'm setting up now the delivery lag is about 30hours - data for 
2018-10-10 turned up at 2018-10-12 05:43.

I was going to just set this up with an S3KeySensor and a daily schedule, but 
I'm wondering if anyone has any other bright ideas for a better way of handling 
this sort of case:

dag = DAG(
DAG_ID
default_args=args,
start_date=args['start_date'],
concurrency=1,
schedule_interval='@daily',
params={'country': cc}
)

with dag:
task = S3KeySensor(
task_id="await_files",
bucket_key="s3://bucket/raw/table1-{{ params.country }}/{{ 
execution_date.strftime('%Y/%m/%d') }}/SUCCESS",
poke_interval=60 * 60 * 2,
timeout=60 * 60 * 72,
)

That S3 key sensor is _going_ to fail the first 18 times or so it runs which 
just seems silly.

One option could be to use `ds_add` or similar on the execution date, but I 
don't like breaking the (obvious) link between execution date and which files 
it picks up, so I've ruled out this option

I could use a Time(Delta)Sensor to just delay the start of the checking. I 
guess with the new change in master to make sensors yield their execution slots 
that's not a terrible plan.

Does anyone else have any other idea, including possible things we could add to 
Airflow itself.

-ash



Re: "setup.py test" is being naughty

2018-10-12 Thread Driesprong, Fokko
We're working hard to get rid of the tight Travis integration and moving to
a Docker based setup. I think it should be very easy to get a Docker up and
running which is packed with the required dependencies. Unfortunately we're
not there yet. Also the tox layer feels a bit redundant to me, since we're
using Docker now.

Cheers, Fokko

Op wo 3 okt. 2018 om 15:08 schreef Jarek Potiuk :

> Local testing works well for a number of unit tests when run from the IDE.
> We of course run full suite of tests via docker environment but our own
> test classess/modules are run using local python environment. It's the
> easiest way to configure local python virtualenv with IntelliJ/Pycharm for
> one. You can - in recent version of PyCharm/IntelliJ - have docker python
> environment setup, but there are certain downsides of using it
> (speed/mounting local volumes with sources etc.).
>
> So I think we should not really discourage running at least some tests
> locally. Maybe (if there are not many of those) we could identify the tests
> which require the full-blown docker environment and mark them with
> skipUnless and only have them executed when we are inside dockerized
> environment for unit tests ?
>
> J.
>
>
> On Wed, Oct 3, 2018 at 1:48 PM Holden Karau  wrote:
>
> > I think (in the short term) discontinuing local testing and telling folks
> > to use the docker based approach makes more sense (many of the tests
> have a
> > complex set of dependencies that don't make sense to try and test
> locally).
> > What do other folks think?
> >
> > On Wed, Oct 3, 2018 at 4:45 AM EKC (Erik Cederstrand)
> >  wrote:
> >
> > > The test suite is also trying to create /usr/local/bin/airflow, which
> > > means I can't run the test suite on a machine that actually uses
> > > /usr/local/bin/airflow. And the default config file doesn't find the
> > MySQL
> > > server I set up locally. I'm trying the Docker-based test environment
> > now.
> > >
> > >
> > > It seems the local test setup either needs polishing or should be
> > > discontinued.
> > >
> > >
> > > Erik
> > >
> > > 
> > > From: EKC (Erik Cederstrand)
> > > Sent: Wednesday, October 3, 2018 12:01:00 PM
> > > To: dev@airflow.incubator.apache.org
> > > Subject: "setup.py test" is being naughty
> > >
> > >
> > > Hi all,
> > >
> > >
> > > I wanted to contribute a simple patch, and as a good open source
> citizen
> > I
> > > wanted to also contribute a test. So I git clone from GitHub, create a
> > > virtualenv and run "setup.py test". First experience is that my
> > > /etc/krb5.conf is overwritten, which means my account is locked out of
> > all
> > > systems here at work. I recovered from that, only to find out that
> > > ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub were also overwritten. Now I'm not
> > very
> > > amused.
> > >
> > >
> > > Did I miss something in CONTRIBUTING.md?
> > >
> > >
> > > Erik
> > >
> >
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> > https://amzn.to/2MaRAG9  
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >
>
>
> --
>
> *Jarek Potiuk, Principal Software Engineer*
> Mobile: +48 660 796 129
>