Pranav, proposal looks awesome!
I have a question and feedback,
You said you tested 1,2 and 3. To create SparkIMain per
notebook, you need information of notebook id. Did you
get it from InterpreterContext?
Then how did you handle destroying of SparkIMain (when
notebook is deleting)?
As far as i know, interpreter not able to get information
of notebook deletion.
>> 4. Build a queue inside interpreter to allow only one
paragraph execution
>> at a time per notebook.
One downside of this approach is, GUI will display
RUNNING instead of PENDING for jobs inside of queue in
interpreter.
Best,
moon
On Sun, Aug 16, 2015 at 12:55 AM IT CTO
<goi....@gmail.com <mailto:goi....@gmail.com>> wrote:
+1 for "to re-factor the Zeppelin architecture so
that it can handle multi-tenancy easily"
On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
<doanduy...@gmail.com <mailto:doanduy...@gmail.com>>
wrote:
Agree with Joel, we may think to re-factor the
Zeppelin architecture so that it can handle
multi-tenancy easily. The technical solution
proposed by Pranav is great but it only applies
to Spark. Right now, each interpreter has to
manage multi-tenancy its own way. Ultimately
Zeppelin can propose a multi-tenancy
contract/info (like UserContext, similar to
InterpreterContext) so that each interpreter can
choose to use or not.
On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
<djo...@gmail.com <mailto:djo...@gmail.com>> wrote:
I think while the idea of running multiple
notes simultaneously is great. It is really
dancing around the lack of true multi user
support in Zeppelin. While the proposed
solution would work if the applications
resources are those of the whole cluster, if
the app is limited (say they are 8 cores of
16, with some distribution in memory) then
potentially your note can hog all the
resources and the scheduler will have to
throttle all other executions leaving you
exactly where you are now.
While I think the solution is a good one,
maybe this question makes us think in adding
true multiuser support.
Where we isolate resources (cluster and the
notebooks themselves), have separate
login/identity and (I don't know if it's
possible) share the same context.
Thanks,
Joel
> On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
<mindpri...@gmail.com
<mailto:mindpri...@gmail.com>> wrote:
>
> If the problem is that multiple users have
to wait for each other while
> using Zeppelin, the solution already
exists: they can create a new
> interpreter by going to the interpreter
page and attach it to their
> notebook - then they don't have to wait for
others to submit their job.
>
> But I agree, having paragraphs from one
note wait for paragraphs from other
> notes is a confusing default. We can get
around that in two ways:
>
> 1. Create a new interpreter for each note
and attach that interpreter to
> that note. This approach would require the least
amount
of code changes but
> is resource heavy and doesn't let you
share Spark Context between different
> notes.
> 2. If we want to share the Spark Context
between different notes, we can
> submit jobs from different notes into
different fairscheduler pools (
>
https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
> This can be done by submitting jobs from
different notes in different
> threads. This will make sure that jobs
from one note are run sequentially
> but jobs from different notes will be
able to run in parallel.
>
> Neither of these options require any change
in the Spark code.
>
> --
> Thanks & Regards
> Rohit Agarwal
> https://www.linkedin.com/in/rohitagarwal003
>
> On Sat, Aug 15, 2015 at 12:01 PM, Pranav
Kumar Agarwal <praag...@gmail.com
<mailto:praag...@gmail.com>>
> wrote:
>
>> If someone can share about the idea of
sharing single SparkContext through
>>> multiple SparkILoop safely, it'll be
really helpful.
>> Here is a proposal:
>> 1. In Spark code, change SparkIMain.scala
to allow setting the virtual
>> directory. While creating new instances of
SparkIMain per notebook from
>> zeppelin spark interpreter set all the
instances of SparkIMain to the same
>> virtual directory.
>> 2. Start HTTP server on that virtual
directory and set this HTTP server in
>> Spark Context using classserverUri method
>> 3. Scala generated code has a notion of
packages. The default package name
>> is "line$<linenumber>". Package name can
be controlled using System
>> Property scala.repl.name.line. Setting
this property to "notebook id"
>> ensures that code generated by individual
instances of SparkIMain is
>> isolated from other instances of SparkIMain
>> 4. Build a queue inside interpreter to
allow only one paragraph execution
>> at a time per notebook.
>>
>> I have tested 1, 2, and 3 and it seems to
provide isolation across
>> classnames. I'll work towards submitting a
formal patch soon - Is there any
>> Jira already for the same that I can
uptake? Also I need to understand:
>> 1. How does Zeppelin uptake Spark fixes?
OR do I need to first work
>> towards getting Spark changes merged in
Apache Spark github?
>>
>> Any suggestions on comments on the
proposal are highly welcome.
>>
>> Regards,
>> -Pranav.
>>
>>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>
>>> Hi piyush,
>>>
>>> Separate instance of SparkILoop
SparkIMain for each notebook while
>>> sharing the SparkContext sounds great.
>>>
>>> Actually, i tried to do it, found problem
that multiple SparkILoop could
>>> generates the same class name, and spark
executor confuses classname since
>>> they're reading classes from single
SparkContext.
>>>
>>> If someone can share about the idea of
sharing single SparkContext
>>> through multiple SparkILoop safely, it'll
be really helpful.
>>>
>>> Thanks,
>>> moon
>>>
>>>
>>> On Mon, Aug 10, 2015 at 1:21 AM Piyush
Mukati (Data Platform) <
>>> piyush.muk...@flipkart.com
<mailto:piyush.muk...@flipkart.com>
<mailto:piyush.muk...@flipkart.com
<mailto:piyush.muk...@flipkart.com>>> wrote:
>>>
>>> Hi Moon,
>>> Any suggestion on it, have to wait lot
when multiple people working
>>> with spark.
>>> Can we create separate instance of
SparkILoop SparkIMain and
>>> printstrems for each notebook while
sharing theSparkContext
>>> ZeppelinContext SQLContext and
DependencyResolver and then use parallel
>>> scheduler ?
>>> thanks
>>>
>>> -piyush
>>>
>>> Hi Moon,
>>>
>>> How about tracking dedicated
SparkContext for a notebook in Spark's
>>> remote interpreter - this will allow
multiple users to run their spark
>>> paragraphs in parallel. Also, within a
notebook only one paragraph is
>>> executed at a time.
>>>
>>> Regards,
>>> -Pranav.
>>>
>>>
>>>> On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>> Hi,
>>>>
>>>> Thanks for asking question.
>>>>
>>>> The reason is simply because of it is
running code statements. The
>>>> statements can have order and
dependency. Imagine i have two
>>> paragraphs
>>>>
>>>> %spark
>>>> val a = 1
>>>>
>>>> %spark
>>>> print(a)
>>>>
>>>> If they're not running one by one, that
means they possibly runs in
>>>> random order and the output will be
always different. Either '1' or
>>>> 'val a can not found'.
>>>>
>>>> This is the reason why. But if there are
nice idea to handle this
>>>> problem i agree using parallel scheduler
would help a lot.
>>>>
>>>> Thanks,
>>>> moon
>>>> On 2015년 7월 14일 (화) at 오후 7:59
linxi zeng
>>>> <linxizeng0...@gmail.com
<mailto:linxizeng0...@gmail.com>
<mailto:linxizeng0...@gmail.com
<mailto:linxizeng0...@gmail.com>>
>>> <mailto:linxizeng0...@gmail.com
<mailto:linxizeng0...@gmail.com>
<mailto:linxizeng0...@gmail.com
<mailto:linxizeng0...@gmail.com>>>>
>>> wrote:
>>>>
>>>> any one who have the same question with
me? or this is not a
>>> question?
>>>>
>>>> 2015-07-14 11:47 GMT+08:00 linxi zeng
<linxizeng0...@gmail.com
<mailto:linxizeng0...@gmail.com>
>>> <mailto:linxizeng0...@gmail.com
<mailto:linxizeng0...@gmail.com>>
>>>> <mailto:linxizeng0...@gmail.com
<mailto:linxizeng0...@gmail.com> <mailto:
>>> linxizeng0...@gmail.com
<mailto:linxizeng0...@gmail.com>>>>:
>>>>
>>>> hi, Moon:
>>>> I notice that the getScheduler
function in the
>>>> SparkInterpreter.java return a
FIFOScheduler which makes the
>>>> spark interpreter run spark job one
by one. It's not a good
>>>> experience when couple of users do
some work on zeppelin at
>>>> the same time, because they have to
wait for each other.
>>>> And at the same time,
SparkSqlInterpreter can chose what
>>>> scheduler to use by
"zeppelin.spark.concurrentSQL".
>>>> My question is, what kind of
consideration do you based on
>>> to
>>>> make such a decision?
>>>
>>>
>>>
>>>
>>>
------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> This email and any files transmitted
with it are confidential and
>>> intended solely for the use of the
individual or entity to whom
>>> they are addressed. If you have
received this email in error
>>> please notify the system manager. This
message contains
>>> confidential information and is intended
only for the individual
>>> named. If you are not the named addressee
you should not
>>> disseminate, distribute or copy this
e-mail. Please notify the
>>> sender immediately by e-mail if you have
received this e-mail by
>>> mistake and delete this e-mail from your
system. If you are not
>>> the intended recipient you are
notified that disclosing, copying,
>>> distributing or taking any action in
reliance on the contents of
>>> this information is strictly
prohibited. Although Flipkart has
>>> taken reasonable precautions to ensure no
viruses are present in
>>> this email, the company cannot accept
responsibility for any loss
>>> or damage arising from the use of this
email or attachments
>>