Any thoughts on how to package changes related to Spark? On 17-Aug-2015 7:58 pm, "moon soo Lee" <m...@apache.org> wrote:
> I think releasing SparkIMain and related objects after configurable > inactivity would be good for now. > > About scheduler, I can help implementing such scheduler. > > Thanks, > moon > > On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal <praag...@gmail.com> > wrote: > >> Hi Moon, >> >> Yes, the notebookid comes from InterpreterContext. At the moment >> destroying SparkIMain on deletion of notebook is not handled. I think >> SparkIMain is a lightweight object, do you see a concern having these >> objects in a map? One possible option could be to destroy notebook related >> objects when the inactivity on a notebook is greater than say 8 hours. >> >> >> >> 4. Build a queue inside interpreter to allow only one paragraph >> execution >> >> at a time per notebook. >> >> One downside of this approach is, GUI will display RUNNING instead of >> PENDING for jobs inside of queue in interpreter. >> >> Yes that's an good point. Having a scheduler at Zeppelin server to build >> a scheduler that is parallel across notebook's and FIFO across paragraph's >> will be nice. Is there any plan for having such a scheduler? >> >> Regards, >> -Pranav. >> >> >> On 17/08/15 5:38 am, moon soo Lee wrote: >> >> Pranav, proposal looks awesome! >> >> I have a question and feedback, >> >> You said you tested 1,2 and 3. To create SparkIMain per notebook, you >> need information of notebook id. Did you get it from InterpreterContext? >> Then how did you handle destroying of SparkIMain (when notebook is >> deleting)? >> As far as i know, interpreter not able to get information of notebook >> deletion. >> >> >> 4. Build a queue inside interpreter to allow only one paragraph >> execution >> >> at a time per notebook. >> >> One downside of this approach is, GUI will display RUNNING instead of >> PENDING for jobs inside of queue in interpreter. >> >> Best, >> moon >> >> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi....@gmail.com> wrote: >> >>> +1 for "to re-factor the Zeppelin architecture so that it can handle >>> multi-tenancy easily" >>> >>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <doanduy...@gmail.com> >>> wrote: >>> >>>> Agree with Joel, we may think to re-factor the Zeppelin architecture so >>>> that it can handle multi-tenancy easily. The technical solution proposed >>>> by Pranav >>>> is great but it only applies to Spark. Right now, each interpreter has to >>>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a >>>> multi-tenancy contract/info (like UserContext, similar to >>>> InterpreterContext) so that each interpreter can choose to use or not. >>>> >>>> >>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <djo...@gmail.com> >>>> wrote: >>>> >>>>> I think while the idea of running multiple notes simultaneously is >>>>> great. It is really dancing around the lack of true multi user support in >>>>> Zeppelin. While the proposed solution would work if the applications >>>>> resources are those of the whole cluster, if the app is limited (say they >>>>> are 8 cores of 16, with some distribution in memory) then potentially your >>>>> note can hog all the resources and the scheduler will have to throttle all >>>>> other executions leaving you exactly where you are now. >>>>> While I think the solution is a good one, maybe this question makes us >>>>> think in adding true multiuser support. >>>>> Where we isolate resources (cluster and the notebooks themselves), >>>>> have separate login/identity and (I don't know if it's possible) share the >>>>> same context. >>>>> >>>>> Thanks, >>>>> Joel >>>>> >>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mindpri...@gmail.com> >>>>> wrote: >>>>> > >>>>> > If the problem is that multiple users have to wait for each other >>>>> while >>>>> > using Zeppelin, the solution already exists: they can create a new >>>>> > interpreter by going to the interpreter page and attach it to their >>>>> > notebook - then they don't have to wait for others to submit their >>>>> job. >>>>> > >>>>> > But I agree, having paragraphs from one note wait for paragraphs >>>>> from other >>>>> > notes is a confusing default. We can get around that in two ways: >>>>> > >>>>> > 1. Create a new interpreter for each note and attach that >>>>> interpreter to >>>>> > that note. This approach would require the least amount of code >>>>> changes but >>>>> > is resource heavy and doesn't let you share Spark Context between >>>>> different >>>>> > notes. >>>>> > 2. If we want to share the Spark Context between different notes, >>>>> we can >>>>> > submit jobs from different notes into different fairscheduler >>>>> pools ( >>>>> > >>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application >>>>> ). >>>>> > This can be done by submitting jobs from different notes in >>>>> different >>>>> > threads. This will make sure that jobs from one note are run >>>>> sequentially >>>>> > but jobs from different notes will be able to run in parallel. >>>>> > >>>>> > Neither of these options require any change in the Spark code. >>>>> > >>>>> > -- >>>>> > Thanks & Regards >>>>> > Rohit Agarwal >>>>> > https://www.linkedin.com/in/rohitagarwal003 >>>>> > >>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal < >>>>> praag...@gmail.com> >>>>> > wrote: >>>>> > >>>>> >> If someone can share about the idea of sharing single SparkContext >>>>> through >>>>> >>> multiple SparkILoop safely, it'll be really helpful. >>>>> >> Here is a proposal: >>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the >>>>> virtual >>>>> >> directory. While creating new instances of SparkIMain per notebook >>>>> from >>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to >>>>> the same >>>>> >> virtual directory. >>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP >>>>> server in >>>>> >> Spark Context using classserverUri method >>>>> >> 3. Scala generated code has a notion of packages. The default >>>>> package name >>>>> >> is "line$<linenumber>". Package name can be controlled using System >>>>> >> Property scala.repl.name.line. Setting this property to "notebook >>>>> id" >>>>> >> ensures that code generated by individual instances of SparkIMain is >>>>> >> isolated from other instances of SparkIMain >>>>> >> 4. Build a queue inside interpreter to allow only one paragraph >>>>> execution >>>>> >> at a time per notebook. >>>>> >> >>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across >>>>> >> classnames. I'll work towards submitting a formal patch soon - Is >>>>> there any >>>>> >> Jira already for the same that I can uptake? Also I need to >>>>> understand: >>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work >>>>> >> towards getting Spark changes merged in Apache Spark github? >>>>> >> >>>>> >> Any suggestions on comments on the proposal are highly welcome. >>>>> >> >>>>> >> Regards, >>>>> >> -Pranav. >>>>> >> >>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote: >>>>> >>> >>>>> >>> Hi piyush, >>>>> >>> >>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while >>>>> >>> sharing the SparkContext sounds great. >>>>> >>> >>>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop >>>>> could >>>>> >>> generates the same class name, and spark executor confuses >>>>> classname since >>>>> >>> they're reading classes from single SparkContext. >>>>> >>> >>>>> >>> If someone can share about the idea of sharing single SparkContext >>>>> >>> through multiple SparkILoop safely, it'll be really helpful. >>>>> >>> >>>>> >>> Thanks, >>>>> >>> moon >>>>> >>> >>>>> >>> >>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) < >>>>> >>> piyush.muk...@flipkart.com <mailto:piyush.muk...@flipkart.com>> >>>>> wrote: >>>>> >>> >>>>> >>> Hi Moon, >>>>> >>> Any suggestion on it, have to wait lot when multiple people >>>>> working >>>>> >>> with spark. >>>>> >>> Can we create separate instance of SparkILoop SparkIMain and >>>>> >>> printstrems for each notebook while sharing theSparkContext >>>>> >>> ZeppelinContext SQLContext and DependencyResolver and then use >>>>> parallel >>>>> >>> scheduler ? >>>>> >>> thanks >>>>> >>> >>>>> >>> -piyush >>>>> >>> >>>>> >>> Hi Moon, >>>>> >>> >>>>> >>> How about tracking dedicated SparkContext for a notebook in >>>>> Spark's >>>>> >>> remote interpreter - this will allow multiple users to run >>>>> their spark >>>>> >>> paragraphs in parallel. Also, within a notebook only one >>>>> paragraph is >>>>> >>> executed at a time. >>>>> >>> >>>>> >>> Regards, >>>>> >>> -Pranav. >>>>> >>> >>>>> >>> >>>>> >>>> On 15/07/15 7:15 pm, moon soo Lee wrote: >>>>> >>>> Hi, >>>>> >>>> >>>>> >>>> Thanks for asking question. >>>>> >>>> >>>>> >>>> The reason is simply because of it is running code statements. The >>>>> >>>> statements can have order and dependency. Imagine i have two >>>>> >>> paragraphs >>>>> >>>> >>>>> >>>> %spark >>>>> >>>> val a = 1 >>>>> >>>> >>>>> >>>> %spark >>>>> >>>> print(a) >>>>> >>>> >>>>> >>>> If they're not running one by one, that means they possibly runs >>>>> in >>>>> >>>> random order and the output will be always different. Either '1' >>>>> or >>>>> >>>> 'val a can not found'. >>>>> >>>> >>>>> >>>> This is the reason why. But if there are nice idea to handle this >>>>> >>>> problem i agree using parallel scheduler would help a lot. >>>>> >>>> >>>>> >>>> Thanks, >>>>> >>>> moon >>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng >>>>> >>>> <linxizeng0...@gmail.com <mailto:linxizeng0...@gmail.com> >>>>> >>> <mailto:linxizeng0...@gmail.com <mailto:linxizeng0...@gmail.com >>>>> >>> >>>>> >>> wrote: >>>>> >>>> >>>>> >>>> any one who have the same question with me? or this is not a >>>>> >>> question? >>>>> >>>> >>>>> >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0...@gmail.com >>>>> >>> <mailto:linxizeng0...@gmail.com> >>>>> >>>> <mailto:linxizeng0...@gmail.com <mailto: >>>>> >>> linxizeng0...@gmail.com>>>: >>>>> >>>> >>>>> >>>> hi, Moon: >>>>> >>>> I notice that the getScheduler function in the >>>>> >>>> SparkInterpreter.java return a FIFOScheduler which makes >>>>> the >>>>> >>>> spark interpreter run spark job one by one. It's not a good >>>>> >>>> experience when couple of users do some work on zeppelin at >>>>> >>>> the same time, because they have to wait for each other. >>>>> >>>> And at the same time, SparkSqlInterpreter can chose what >>>>> >>>> scheduler to use by "zeppelin.spark.concurrentSQL". >>>>> >>>> My question is, what kind of consideration do you based on >>>>> >>> to >>>>> >>>> make such a decision? >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> ------------------------------------------------------------------------------------------------------------------------------------------ >>>>> >>> >>>>> >>> This email and any files transmitted with it are confidential >>>>> and >>>>> >>> intended solely for the use of the individual or entity to whom >>>>> >>> they are addressed. If you have received this email in error >>>>> >>> please notify the system manager. This message contains >>>>> >>> confidential information and is intended only for the individual >>>>> >>> named. If you are not the named addressee you should not >>>>> >>> disseminate, distribute or copy this e-mail. Please notify the >>>>> >>> sender immediately by e-mail if you have received this e-mail by >>>>> >>> mistake and delete this e-mail from your system. If you are not >>>>> >>> the intended recipient you are notified that disclosing, >>>>> copying, >>>>> >>> distributing or taking any action in reliance on the contents of >>>>> >>> this information is strictly prohibited. Although Flipkart has >>>>> >>> taken reasonable precautions to ensure no viruses are present in >>>>> >>> this email, the company cannot accept responsibility for any >>>>> loss >>>>> >>> or damage arising from the use of this email or attachments >>>>> >> >>>>> >>>> >>>> >>