Re: why zeppelin SparkInterpreter use FIFOScheduler

Pranav Kumar Agarwal Tue, 25 Aug 2015 23:22:28 -0700

Hi Moon,

I think releasing SparkIMain and related objects

By packaging I meant to ask what is the process to "release SparkIMainand related objects"? for Zeppelin's code uptake?


I have one more question:

Most the changes to allow SparkInterpreter support ParallelScheduler areimplemented but I'm struggling with the completion feature. Since I haveSparkIMain interpreter for each notebook, completion functionality isnot working as expected cause the completion method doesn't haveInterpreterContext. I need to be able to pull notebook specificSparkIMain interpreter to return correct completion results, and forthat I need to know the notbook-id at the time of completion call.

I'm planning to change the Interpreter.java abstract method completionto pass InterpreterContext along with buffer and cursor location. Thiswill require refactoring all the Interpreter's. It's a change in thecontract, so thought will run with you before embarking on it...


Please let me know your thoughts.

Regards,
-Pranav.

On 18/08/15 8:04 am, moon soo Lee wrote:

Could you explain little bit more about package changes you mean?

Thanks,
moon

On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <praag...@gmail.com<mailto:praag...@gmail.com>> wrote:


    Any thoughts on how to package changes related to Spark?

    On 17-Aug-2015 7:58 pm, "moon soo Lee" <m...@apache.org
    <mailto:m...@apache.org>> wrote:

        I think releasing SparkIMain and related objects after
        configurable inactivity would be good for now.

        About scheduler, I can help implementing such scheduler.

        Thanks,
        moon

        On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal
        <praag...@gmail.com <mailto:praag...@gmail.com>> wrote:

            Hi Moon,

            Yes, the notebookid comes from InterpreterContext. At the
            moment destroying SparkIMain on deletion of notebook is
            not handled. I think SparkIMain is a lightweight object,
            do you see a concern having these objects in a map? One
            possible option could be to destroy notebook related
            objects when the inactivity on a notebook is greater than
            say 8 hours.

            >> 4. Build a queue inside interpreter to allow only one
            paragraph execution
            >> at a time per notebook.

            One downside of this approach is, GUI will display
            RUNNING instead of PENDING for jobs inside of queue in
            interpreter.

            Yes that's an good point. Having a scheduler at Zeppelin
            server to build a scheduler that is parallel across
            notebook's and FIFO across paragraph's will be nice. Is
            there any plan for having such a scheduler?

            Regards,
            -Pranav.


            On 17/08/15 5:38 am, moon soo Lee wrote:

            Pranav, proposal looks awesome!

            I have a question and feedback,

            You said you tested 1,2 and 3. To create SparkIMain per
            notebook, you need information of notebook id. Did you
            get it from InterpreterContext?
            Then how did you handle destroying of SparkIMain (when
            notebook is deleting)?
            As far as i know, interpreter not able to get information
            of notebook deletion.

            >> 4. Build a queue inside interpreter to allow only one
            paragraph execution
            >> at a time per notebook.

            One downside of this approach is, GUI will display
            RUNNING instead of PENDING for jobs inside of queue in
            interpreter.

            Best,
            moon

            On Sun, Aug 16, 2015 at 12:55 AM IT CTO
            <goi....@gmail.com <mailto:goi....@gmail.com>> wrote:

                +1 for "to re-factor the Zeppelin architecture so
                that it can handle multi-tenancy easily"

                On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
                <doanduy...@gmail.com <mailto:doanduy...@gmail.com>>
                wrote:

                    Agree with Joel, we may think to re-factor the
                    Zeppelin architecture so that it can handle
                    multi-tenancy easily. The technical solution
                    proposed by Pranav is great but it only applies
                    to Spark. Right now, each interpreter has to
                    manage multi-tenancy its own way. Ultimately
                    Zeppelin can propose a multi-tenancy
                    contract/info (like UserContext, similar to
                    InterpreterContext) so that each interpreter can
                    choose to use or not.


                    On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
                    <djo...@gmail.com <mailto:djo...@gmail.com>> wrote:

                        I think while the idea of running multiple
                        notes simultaneously is great. It is really
                        dancing around the lack of true multi user
                        support in Zeppelin. While the proposed
                        solution would work if the applications
                        resources are those of the whole cluster, if
                        the app is limited (say they are 8 cores of
                        16, with some distribution in memory) then
                        potentially your note can hog all the
                        resources and the scheduler will have to
                        throttle all other executions leaving you
                        exactly where you are now.
                        While I think the solution is a good one,
                        maybe this question makes us think in adding
                        true multiuser support.
                        Where we isolate resources (cluster and the
                        notebooks themselves), have separate
                        login/identity and (I don't know if it's
                        possible) share the same context.

                        Thanks,
                        Joel

                        > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
                        <mindpri...@gmail.com
                        <mailto:mindpri...@gmail.com>> wrote:
                        >
                        > If the problem is that multiple users have
                        to wait for each other while
                        > using Zeppelin, the solution already
                        exists: they can create a new
                        > interpreter by going to the interpreter
                        page and attach it to their
                        > notebook - then they don't have to wait for
                        others to submit their job.
                        >
                        > But I agree, having paragraphs from one
                        note wait for paragraphs from other
                        > notes is a confusing default. We can get
                        around that in two ways:
                        >
                        >   1. Create a new interpreter for each note
                        and attach that interpreter to
                        >   that note. This approach would require the least 
amount
                        of code changes but
                        >   is resource heavy and doesn't let you
                        share Spark Context between different
                        >   notes.
                        >   2. If we want to share the Spark Context
                        between different notes, we can
                        >   submit jobs from different notes into
                        different fairscheduler pools (
                        >
                        
https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
                        >   This can be done by submitting jobs from
                        different notes in different
                        >   threads. This will make sure that jobs
                        from one note are run sequentially
                        >   but jobs from different notes will be
                        able to run in parallel.
                        >
                        > Neither of these options require any change
                        in the Spark code.
                        >
                        > --
                        > Thanks & Regards
                        > Rohit Agarwal
                        > https://www.linkedin.com/in/rohitagarwal003
                        >
                        > On Sat, Aug 15, 2015 at 12:01 PM, Pranav
                        Kumar Agarwal <praag...@gmail.com
                        <mailto:praag...@gmail.com>>
                        > wrote:
                        >
                        >> If someone can share about the idea of
                        sharing single SparkContext through
                        >>> multiple SparkILoop safely, it'll be
                        really helpful.
                        >> Here is a proposal:
                        >> 1. In Spark code, change SparkIMain.scala
                        to allow setting the virtual
                        >> directory. While creating new instances of
                        SparkIMain per notebook from
                        >> zeppelin spark interpreter set all the
                        instances of SparkIMain to the same
                        >> virtual directory.
                        >> 2. Start HTTP server on that virtual
                        directory and set this HTTP server in
                        >> Spark Context using classserverUri method
                        >> 3. Scala generated code has a notion of
                        packages. The default package name
                        >> is "line$<linenumber>". Package name can
                        be controlled using System
                        >> Property scala.repl.name.line. Setting
                        this property to "notebook id"
                        >> ensures that code generated by individual
                        instances of SparkIMain is
                        >> isolated from other instances of SparkIMain
                        >> 4. Build a queue inside interpreter to
                        allow only one paragraph execution
                        >> at a time per notebook.
                        >>
                        >> I have tested 1, 2, and 3 and it seems to
                        provide isolation across
                        >> classnames. I'll work towards submitting a
                        formal patch soon - Is there any
                        >> Jira already for the same that I can
                        uptake? Also I need to understand:
                        >> 1. How does Zeppelin uptake Spark fixes?
                        OR do I need to first work
                        >> towards getting Spark changes merged in
                        Apache Spark github?
                        >>
                        >> Any suggestions on comments on the
                        proposal are highly welcome.
                        >>
                        >> Regards,
                        >> -Pranav.
                        >>
                        >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
                        >>>
                        >>> Hi piyush,
                        >>>
                        >>> Separate instance of SparkILoop
                        SparkIMain for each notebook while
                        >>> sharing the SparkContext sounds great.
                        >>>
                        >>> Actually, i tried to do it, found problem
                        that multiple SparkILoop could
                        >>> generates the same class name, and spark
                        executor confuses classname since
                        >>> they're reading classes from single
                        SparkContext.
                        >>>
                        >>> If someone can share about the idea of
                        sharing single SparkContext
                        >>> through multiple SparkILoop safely, it'll
                        be really helpful.
                        >>>
                        >>> Thanks,
                        >>> moon
                        >>>
                        >>>
                        >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush
                        Mukati (Data Platform) <
                        >>> piyush.muk...@flipkart.com
                        <mailto:piyush.muk...@flipkart.com>
                        <mailto:piyush.muk...@flipkart.com
                        <mailto:piyush.muk...@flipkart.com>>> wrote:
                        >>>
                        >>>    Hi Moon,
                        >>>    Any suggestion on it, have to wait lot
                        when multiple people  working
                        >>> with spark.
                        >>>    Can we create separate instance of
                         SparkILoop SparkIMain and
                        >>> printstrems  for each notebook while
                        sharing theSparkContext
                        >>> ZeppelinContext  SQLContext and
                        DependencyResolver and then use parallel
                        >>> scheduler ?
                        >>> thanks
                        >>>
                        >>> -piyush
                        >>>
                        >>>    Hi Moon,
                        >>>
                        >>>    How about tracking dedicated
                        SparkContext for a notebook in Spark's
                        >>> remote interpreter - this will allow
                        multiple users to run their spark
                        >>> paragraphs in parallel. Also, within a
                        notebook only one paragraph is
                        >>> executed at a time.
                        >>>
                        >>> Regards,
                        >>> -Pranav.
                        >>>
                        >>>
                        >>>> On 15/07/15 7:15 pm, moon soo Lee wrote:
                        >>>> Hi,
                        >>>>
                        >>>> Thanks for asking question.
                        >>>>
                        >>>> The reason is simply because of it is
                        running code statements. The
                        >>>> statements can have order and
                        dependency. Imagine i have two
                        >>> paragraphs
                        >>>>
                        >>>> %spark
                        >>>> val a = 1
                        >>>>
                        >>>> %spark
                        >>>> print(a)
                        >>>>
                        >>>> If they're not running one by one, that
                        means they possibly runs in
                        >>>> random order and the output will be
                        always different. Either '1' or
                        >>>> 'val a can not found'.
                        >>>>
                        >>>> This is the reason why. But if there are
                        nice idea to handle this
                        >>>> problem i agree using parallel scheduler
                        would help a lot.
                        >>>>
                        >>>> Thanks,
                        >>>> moon
                        >>>> On 2015년 7월 14일 (화) at 오후 7:59
                        linxi zeng
                        >>>> <linxizeng0...@gmail.com
                        <mailto:linxizeng0...@gmail.com>
                        <mailto:linxizeng0...@gmail.com
                        <mailto:linxizeng0...@gmail.com>>
                        >>> <mailto:linxizeng0...@gmail.com
                        <mailto:linxizeng0...@gmail.com>
                        <mailto:linxizeng0...@gmail.com
                        <mailto:linxizeng0...@gmail.com>>>>
                        >>> wrote:
                        >>>>
                        >>>> any one who have the same question with
                        me? or this is not a
                        >>> question?
                        >>>>
                        >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng
                        <linxizeng0...@gmail.com
                        <mailto:linxizeng0...@gmail.com>
                        >>> <mailto:linxizeng0...@gmail.com
                        <mailto:linxizeng0...@gmail.com>>
                        >>>> <mailto:linxizeng0...@gmail.com
                        <mailto:linxizeng0...@gmail.com> <mailto:
                        >>> linxizeng0...@gmail.com
                        <mailto:linxizeng0...@gmail.com>>>>:
                        >>>>
                        >>>>     hi, Moon:
                        >>>>        I notice that the getScheduler
                        function in the
                        >>>> SparkInterpreter.java return a
                        FIFOScheduler which makes the
                        >>>>     spark interpreter run spark job one
                        by one. It's not a good
                        >>>>     experience when couple of users do
                        some work on zeppelin at
                        >>>>     the same time, because they have to
                        wait for each other.
                        >>>>     And at the same time,
                        SparkSqlInterpreter can chose what
                        >>>>     scheduler to use by
                        "zeppelin.spark.concurrentSQL".
                        >>>>     My question is, what kind of
                        consideration do you based on
                        >>> to
                        >>>>     make such a decision?
                        >>>
                        >>>
                        >>>
                        >>>
                        >>>
                        
------------------------------------------------------------------------------------------------------------------------------------------
                        >>>
                        >>>    This email and any files transmitted
                        with it are confidential and
                        >>> intended solely for the use of the
                        individual or entity to whom
                        >>>    they are addressed. If you have
                        received this email in error
                        >>> please notify the system manager. This
                        message contains
                        >>> confidential information and is intended
                        only for the individual
                        >>> named. If you are not the named addressee
                        you should not
                        >>> disseminate, distribute or copy this
                        e-mail. Please notify the
                        >>> sender immediately by e-mail if you have
                        received this e-mail by
                        >>> mistake and delete this e-mail from your
                        system. If you are not
                        >>>    the intended recipient you are
                        notified that disclosing, copying,
                        >>> distributing or taking any action in
                        reliance on the contents of
                        >>>    this information is strictly
                        prohibited. Although Flipkart has
                        >>> taken reasonable precautions to ensure no
                        viruses are present in
                        >>>    this email, the company cannot accept
                        responsibility for any loss
                        >>>    or damage arising from the use of this
                        email or attachments
                        >>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Reply via email to