Interesting idea.  But by ‘p1’, ‘p2’, etc did you literally mean that; or were 
you using that as shorthand for the id of the paragraph?
If the former then what happens if someone inserts, deletes or reorders 
paragraphs? But if the latter then the paragraph ids wouldn’t be very easy for 
someone to read and follow the dependency relationships…

From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: 29 September 2017 11:58
To: users@zeppelin.apache.org
Subject: EXT: Re: Implementing run all paragraphs sequentially


I don't think 2 note setting (parallel/sequential) is sufficient for paragraph 
scheduling (take the spark tutorial note as an example, we should run the 
loading bank data paragraph first and then could run all the sql paragraph 
parallelly).  So the key is how we define the dependency relationship between 
paragraphs.  Paragraphs of note could build a DAG (directed acyclic graph). 
Sequential running is just one special kind of DAG (a linked list).

I believe we discuss it before in community.  My proposal is that we could add 
attribute to the interpreter indicator of each paragraph, so that user can 
specify the paragraph's dependency (If user don't specify it, the default 
dependency is the paragraph ahead of it).  Still take the spark tutorial note 
as an example. We have 3 paragraphes, the first one will load bank data, and 
the second, third paragraph will query the data. So paragraph 2,3 can run 
parallelly but must run after paragraph 1. Then we need to specify their 
dependency in the interpreter indicator part.  Of course, user don't need to 
specify dependencies if the want to run all the paragraphes sequentially, 
because the default dependencies is the paragraph ahead of it.

Paragraph 1.

%spark
// code to load bank data

Paragraph 2.

%spark.sql(deps=p1)
// query the bank data

Paragraph 3.
%spark.sql(deps=p1)
// query the bank data




afancy <grou...@gmail.com<mailto:grou...@gmail.com>>于2017年9月29日周五 下午5:35写道:
+1

I think this is one of the most important features. don't know why this 
requirement has been skipped.

/afancy

On Thu, Sep 28, 2017 at 5:28 PM, Belousov Maksim Eduardovich 
<m.belou...@tinkoff.ru<mailto:m.belou...@tinkoff.ru>> wrote:
Hello, users!
At the moment our analysts often use mixes of interpreters in their notes.
For example, they prepare data using %jdbc and then use it in %pyspark. 
Besides, they often use scheduling to make some regular reporting. And they 
should do something like `time.sleep()` to wait for the data from %jdbc. It 
doesn`t guarantee the result and doesn`t look cool.

You can find early attempts to implement sequential running of all paragraphs 
in [1].
We are really interested in implementation of the issue [2] and are ready to 
solve it.
It seems a good idea to discuss any requirements.
My idea is to introduce note setting that defines the type of running to use 
(parallel or sequential) and leave "Run all" to be the only button running all 
the cells in the note. This will make sequential or parallel running the `note 
option` but not `run option`.
Option will be controlled by nearby button as shown
[https://lh6.googleusercontent.com/jwnb7xfb0fPbFg1CWPoMSqovu7ecSMv4pJfuP4zdKVZbyAUDwzAT2GJ5EiemXVYrqMW73yklemTpjXNyLRJABpTCoHi6us2ZI_AxWKHwZpBEA7MjpMP0-7Nk8saaJQfIF4yBMPfS]


For new notes the default state would be "Run sequential all", for old - "Run 
parallel for interpreters"
We are glad to hear any thoughts.
Thank you.

[1] https://issues.apache.org/jira/browse/ZEPPELIN-1165
[2] https://issues.apache.org/jira/browse/ZEPPELIN-2368



Maksim Belousov


Reply via email to