Any thoughts on how to package changes related to Spark?
On 17-Aug-2015 7:58 pm, "moon soo Lee" <m...@apache.org> wrote:

> I think releasing SparkIMain and related objects after configurable
> inactivity would be good for now.
>
> About scheduler, I can help implementing such scheduler.
>
> Thanks,
> moon
>
> On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal <praag...@gmail.com>
> wrote:
>
>> Hi Moon,
>>
>> Yes, the notebookid comes from InterpreterContext. At the moment
>> destroying SparkIMain on deletion of notebook is not handled. I think
>> SparkIMain is a lightweight object, do you see a concern having these
>> objects in a map? One possible option could be to destroy notebook related
>> objects when the inactivity on a notebook is greater than say 8 hours.
>>
>>
>> >> 4. Build a queue inside interpreter to allow only one paragraph
>> execution
>> >> at a time per notebook.
>>
>> One downside of this approach is, GUI will display RUNNING instead of
>> PENDING for jobs inside of queue in interpreter.
>>
>> Yes that's an good point. Having a scheduler at Zeppelin server to build
>> a scheduler that is parallel across notebook's and FIFO across paragraph's
>> will be nice. Is there any plan for having such a scheduler?
>>
>> Regards,
>> -Pranav.
>>
>>
>> On 17/08/15 5:38 am, moon soo Lee wrote:
>>
>> Pranav, proposal looks awesome!
>>
>> I have a question and feedback,
>>
>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>> need information of notebook id. Did you get it from InterpreterContext?
>> Then how did you handle destroying of SparkIMain (when notebook is
>> deleting)?
>> As far as i know, interpreter not able to get information of notebook
>> deletion.
>>
>> >> 4. Build a queue inside interpreter to allow only one paragraph
>> execution
>> >> at a time per notebook.
>>
>> One downside of this approach is, GUI will display RUNNING instead of
>> PENDING for jobs inside of queue in interpreter.
>>
>> Best,
>> moon
>>
>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi....@gmail.com> wrote:
>>
>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>> multi-tenancy easily"
>>>
>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <doanduy...@gmail.com>
>>> wrote:
>>>
>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture so
>>>> that it can handle multi-tenancy easily. The technical solution proposed 
>>>> by Pranav
>>>> is great but it only applies to Spark. Right now, each interpreter has to
>>>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
>>>> multi-tenancy contract/info (like UserContext, similar to
>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>
>>>>
>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <djo...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think while the idea of running multiple notes simultaneously is
>>>>> great. It is really dancing around the lack of true multi user support in
>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>>> note can hog all the resources and the scheduler will have to throttle all
>>>>> other executions leaving you exactly where you are now.
>>>>> While I think the solution is a good one, maybe this question makes us
>>>>> think in adding true multiuser support.
>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>> have separate login/identity and (I don't know if it's possible) share the
>>>>> same context.
>>>>>
>>>>> Thanks,
>>>>> Joel
>>>>>
>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mindpri...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > If the problem is that multiple users have to wait for each other
>>>>> while
>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>> > notebook - then they don't have to wait for others to submit their
>>>>> job.
>>>>> >
>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>> from other
>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>> >
>>>>> >   1. Create a new interpreter for each note and attach that
>>>>> interpreter to
>>>>> >   that note. This approach would require the least amount of code
>>>>> changes but
>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>> different
>>>>> >   notes.
>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>> we can
>>>>> >   submit jobs from different notes into different fairscheduler
>>>>> pools (
>>>>> >
>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>> ).
>>>>> >   This can be done by submitting jobs from different notes in
>>>>> different
>>>>> >   threads. This will make sure that jobs from one note are run
>>>>> sequentially
>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>> >
>>>>> > Neither of these options require any change in the Spark code.
>>>>> >
>>>>> > --
>>>>> > Thanks & Regards
>>>>> > Rohit Agarwal
>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>> >
>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>> praag...@gmail.com>
>>>>> > wrote:
>>>>> >
>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>> through
>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>> >> Here is a proposal:
>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>> virtual
>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>> from
>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>> the same
>>>>> >> virtual directory.
>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>> server in
>>>>> >> Spark Context using classserverUri method
>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>> package name
>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>> id"
>>>>> >> ensures that code generated by individual instances of SparkIMain is
>>>>> >> isolated from other instances of SparkIMain
>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>> execution
>>>>> >> at a time per notebook.
>>>>> >>
>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>> there any
>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>> understand:
>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>> >>
>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>> >>
>>>>> >> Regards,
>>>>> >> -Pranav.
>>>>> >>
>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>> >>>
>>>>> >>> Hi piyush,
>>>>> >>>
>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>> >>> sharing the SparkContext sounds great.
>>>>> >>>
>>>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>>>>> could
>>>>> >>> generates the same class name, and spark executor confuses
>>>>> classname since
>>>>> >>> they're reading classes from single SparkContext.
>>>>> >>>
>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>> moon
>>>>> >>>
>>>>> >>>
>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>> >>> piyush.muk...@flipkart.com <mailto:piyush.muk...@flipkart.com>>
>>>>> wrote:
>>>>> >>>
>>>>> >>>    Hi Moon,
>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>> working
>>>>> >>> with spark.
>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>> parallel
>>>>> >>> scheduler ?
>>>>> >>>    thanks
>>>>> >>>
>>>>> >>>    -piyush
>>>>> >>>
>>>>> >>>    Hi Moon,
>>>>> >>>
>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>> Spark's
>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>> their spark
>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>> paragraph is
>>>>> >>>    executed at a time.
>>>>> >>>
>>>>> >>>    Regards,
>>>>> >>>    -Pranav.
>>>>> >>>
>>>>> >>>
>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>> >>>> Hi,
>>>>> >>>>
>>>>> >>>> Thanks for asking question.
>>>>> >>>>
>>>>> >>>> The reason is simply because of it is running code statements. The
>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>> >>> paragraphs
>>>>> >>>>
>>>>> >>>> %spark
>>>>> >>>> val a = 1
>>>>> >>>>
>>>>> >>>> %spark
>>>>> >>>> print(a)
>>>>> >>>>
>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>> in
>>>>> >>>> random order and the output will be always different. Either '1'
>>>>> or
>>>>> >>>> 'val a can not found'.
>>>>> >>>>
>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>> >>>> moon
>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>> >>>> <linxizeng0...@gmail.com  <mailto:linxizeng0...@gmail.com>
>>>>> >>> <mailto:linxizeng0...@gmail.com  <mailto:linxizeng0...@gmail.com
>>>>> >>>
>>>>> >>> wrote:
>>>>> >>>>
>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>> >>> question?
>>>>> >>>>
>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0...@gmail.com
>>>>> >>> <mailto:linxizeng0...@gmail.com>
>>>>> >>>>    <mailto:linxizeng0...@gmail.com  <mailto:
>>>>> >>> linxizeng0...@gmail.com>>>:
>>>>> >>>>
>>>>> >>>>        hi, Moon:
>>>>> >>>>           I notice that the getScheduler function in the
>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>> the
>>>>> >>>>        spark interpreter run spark job one by one. It's not a good
>>>>> >>>>        experience when couple of users do some work on zeppelin at
>>>>> >>>>        the same time, because they have to wait for each other.
>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>> >>> to
>>>>> >>>>        make such a decision?
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>> >>>
>>>>> >>>    This email and any files transmitted with it are confidential
>>>>> and
>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>> >>>    they are addressed. If you have received this email in error
>>>>> >>>    please notify the system manager. This message contains
>>>>> >>>    confidential information and is intended only for the individual
>>>>> >>>    named. If you are not the named addressee you should not
>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>> >>>    sender immediately by e-mail if you have received this e-mail by
>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>> copying,
>>>>> >>>    distributing or taking any action in reliance on the contents of
>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>> >>>    taken reasonable precautions to ensure no viruses are present in
>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>> loss
>>>>> >>>    or damage arising from the use of this email or attachments
>>>>> >>
>>>>>
>>>>
>>>>
>>

Reply via email to