Thanks for sharing this Ruslan - I will take a look. I agree that paragraphs can form tasks within a DAG. My point was that ideally a DAG could encompass multiple notes. I.e. the completion of one note triggers another and so on to complete an entire chain of dependent tasks.
For example team A has a note that generates data set A*. Teams B & C each have notes that depend on A* to generate B* & C* for their specific purposes. It doesn't make sense for all of that to have to live in one note, but they are all part of a single workflow. Best, --Ben On Fri, May 19, 2017 at 9:02 PM, Ruslan Dautkhanov <dautkha...@gmail.com> wrote: > Thanks for sharing this Ben. > > I agree Zeppelin is a better fit with tighter integration with Spark and > built-in visualizations. > > We have pretty much standardized on pySpark, so here's one of the scripts > we use internally > to extract %pyspark, %sql and %md paragraphs into a standalone script > (that can be scheduled in Airflow for example) > https://github.com/Tagar/stuff/blob/master/znote.py (patches are welcome > :-) > > Hope this helps. > > ps. In my opinion adding dependencies between paragraphs wouldn't be that > hard for simple cases, > and can be first step to define a DAG in Zeppelin directly. It would be > really awesome if we see this type of > integration in the future. > > Othewise I don't see much value if a whole note/ whole workflow would run > as a single task in Airflow. > In my opinion, each paragraph has to be a task... then it'll be very > useful. > > > Thanks, > Ruslan > > > On Fri, May 19, 2017 at 4:55 PM, Ben Vogan <b...@shopkick.com> wrote: > >> I do not expect the relationship between DAGs to be described in Zeppelin >> - that would be done in Airflow. It just seems that Zeppelin is such a >> great tool for a data scientists workflow that it would be nice if once >> they are done with the work the note could be productionized directly. I >> could envision a couple of scenarios: >> >> 1. Using a zeppelin instance to run the note via the REST API. The >> instance could be containerized and spun up specifically for a DAG or it >> could be a permanently available one. >> 2. A note could be pulled from git and some part of the Zeppelin engine >> could execute the note without the web UI at all. >> >> I would expect on the airflow side there to be some special operators for >> executing these. >> >> If the scheduler is pluggable then it should be possible to create a plug >> in that talks to the Airflow REST API. >> >> I happen to prefer Zeppelin to Jupyter - although I get your point about >> both being python. I don't really view that as a problem - most of the big >> data platforms I'm talking to are implemented on the JVM after all. The >> python part of Airflow is really just describing what gets run and it isn't >> hard to run something that isn't written in python. >> >> On Fri, May 19, 2017 at 2:52 PM, Ruslan Dautkhanov <dautkha...@gmail.com> >> wrote: >> >>> We also use both Zeppelin and Airflow. >>> >>> I'm interested in hearing what others are doing here too. >>> >>> Although honestly there might be some challenges >>> - Airflow expects a DAG structure, while a notebook has pretty linear >>> structure; >>> - Airflow is Python-based; Zeppelin is all Java (REST API might be of >>> help?). >>> Jupyter+Airflow might be a more natural fit to integrate? >>> >>> On top of that, the way we use Zeppelin is a lot of ad-hoc queries, >>> while Airflow is for more finalized workflows I guess? >>> >>> Thanks for bringing this up. >>> >>> >>> >>> -- >>> Ruslan Dautkhanov >>> >>> On Fri, May 19, 2017 at 2:20 PM, Ben Vogan <b...@shopkick.com> wrote: >>> >>>> Hi all, >>>> >>>> We are really enjoying the workflow of interacting with our data via >>>> Zeppelin, but are not sold on using the built in cron scheduling >>>> capability. We would like to be able to create more complex DAGs that are >>>> better suited for something like Airflow. I was curious as to whether >>>> anyone has done an integration of Zeppelin with Airflow. >>>> >>>> Either directly from within Zeppelin, or from the Airflow side. >>>> >>>> Thanks, >>>> -- >>>> *BENJAMIN VOGAN* | Data Platform Team Lead >>>> >>>> <http://www.shopkick.com/> >>>> <https://www.facebook.com/shopkick> >>>> <https://www.instagram.com/shopkick/> >>>> <https://www.pinterest.com/shopkick/> <https://twitter.com/shopkickbiz> >>>> <https://www.linkedin.com/company-beta/831240/?pathWildcard=831240> >>>> >>> >>> >> >> >> -- >> *BENJAMIN VOGAN* | Data Platform Team Lead >> >> <http://www.shopkick.com/> >> <https://www.facebook.com/shopkick> <https://www.instagram.com/shopkick/> >> <https://www.pinterest.com/shopkick/> <https://twitter.com/shopkickbiz> >> <https://www.linkedin.com/company-beta/831240/?pathWildcard=831240> >> > > -- *BENJAMIN VOGAN* | Data Platform Team Lead <http://www.shopkick.com/> <https://www.facebook.com/shopkick> <https://www.instagram.com/shopkick/> <https://www.pinterest.com/shopkick/> <https://twitter.com/shopkickbiz> <https://www.linkedin.com/company-beta/831240/?pathWildcard=831240>