Re: why zeppelin SparkInterpreter use FIFOScheduler

Pranav Kumar Agarwal Sat, 15 Aug 2015 11:56:11 -0700

If someone can share about the idea of sharing single SparkContextthrough multiple SparkILoop safely, it'll be really helpful.

Here is a proposal:

1. In Spark code, change SparkIMain.scala to allow setting the virtualdirectory. While creating new instances of SparkIMain per notebook fromzeppelin spark interpreter set all the instances of SparkIMain to thesame virtual directory.2. Start HTTP server on that virtual directory and set this HTTP serverin Spark Context using classserverUri method3. Scala generated code has a notion of packages. The default packagename is "line$<linenumber>". Package name can be controlled using SystemProperty scala.repl.name.line. Setting this property to "notebook id"ensures that code generated by individual instances of SparkIMain isisolated from other instances of SparkIMain4. Build a queue inside interpreter to allow only one paragraphexecution at a time per notebook.

I have tested 1, 2, and 3 and it seems to provide isolation acrossclassnames. I'll work towards submitting a formal patch soon - Is thereany Jira already for the same that I can uptake? Also I need to understand:1. How does Zeppelin uptake Spark fixes? OR do I need to first worktowards getting Spark changes merged in Apache Spark github?


Any suggestions on comments on the proposal are highly welcome.

Regards,
-Pranav.

On 10/08/15 11:36 pm, moon soo Lee wrote:

Hi piyush,

Separate instance of SparkILoop SparkIMain for each notebook whilesharing the SparkContext sounds great.

Actually, i tried to do it, found problem that multiple SparkILoopcould generates the same class name, and spark executor confusesclassname since they're reading classes from single SparkContext.

If someone can share about the idea of sharing single SparkContextthrough multiple SparkILoop safely, it'll be really helpful.


Thanks,
moon

On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform)<piyush.muk...@flipkart.com <mailto:piyush.muk...@flipkart.com>> wrote:


    Hi Moon,
    Any suggestion on it, have to wait lot when multiple people  working with 
spark.
    Can we create separate instance of   SparkILoop  SparkIMain and printstrems 
 for each notebook while sharing theSparkContext  ZeppelinContext   SQLContext 
and DependencyResolver and then use parallel scheduler ?
    thanks

    -piyush

    Hi Moon,

    How about tracking dedicated SparkContext for a notebook in Spark's
    remote interpreter - this will allow multiple users to run their spark
    paragraphs in parallel. Also, within a notebook only one paragraph is
    executed at a time.

    Regards,
    -Pranav.


    On 15/07/15 7:15 pm, moon soo Lee wrote:
    > Hi,
    >
    > Thanks for asking question.
    >
    > The reason is simply because of it is running code statements. The
    > statements can have order and dependency. Imagine i have two paragraphs
    >
    > %spark
    > val a = 1
    >
    > %spark
    > print(a)
    >
    > If they're not running one by one, that means they possibly runs in
    > random order and the output will be always different. Either '1' or
    > 'val a can not found'.
    >
    > This is the reason why. But if there are nice idea to handle this
    > problem i agree using parallel scheduler would help a lot.
    >
    > Thanks,
    > moon
    > On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
    > <linxizeng0...@gmail.com  <mailto:linxizeng0...@gmail.com>  
<mailto:linxizeng0...@gmail.com  <mailto:linxizeng0...@gmail.com>>> wrote:
    >
    >     any one who have the same question with me? or this is not a question?
    >
    >     2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0...@gmail.com  
<mailto:linxizeng0...@gmail.com>
    >     <mailto:linxizeng0...@gmail.com  <mailto:linxizeng0...@gmail.com>>>:
    >
    >         hi, Moon:
    >            I notice that the getScheduler function in the
    >         SparkInterpreter.java return a FIFOScheduler which makes the
    >         spark interpreter run spark job one by one. It's not a good
    >         experience when couple of users do some work on zeppelin at
    >         the same time, because they have to wait for each other.
    >         And at the same time, SparkSqlInterpreter can chose what
    >         scheduler to use by "zeppelin.spark.concurrentSQL".
    >         My question is, what kind of consideration do you based on to
    >         make such a decision?
    >
    >



    
------------------------------------------------------------------------------------------------------------------------------------------

    This email and any files transmitted with it are confidential and
    intended solely for the use of the individual or entity to whom
    they are addressed. If you have received this email in error
    please notify the system manager. This message contains
    confidential information and is intended only for the individual
    named. If you are not the named addressee you should not
    disseminate, distribute or copy this e-mail. Please notify the
    sender immediately by e-mail if you have received this e-mail by
    mistake and delete this e-mail from your system. If you are not
    the intended recipient you are notified that disclosing, copying,
    distributing or taking any action in reliance on the contents of
    this information is strictly prohibited. Although Flipkart has
    taken reasonable precautions to ensure no viruses are present in
    this email, the company cannot accept responsibility for any loss
    or damage arising from the use of this email or attachments

Re: why zeppelin SparkInterpreter use FIFOScheduler

Reply via email to