Re: Change for submitting to yarn in 1.3.1
Marcelo Thanks for the comments. All my requirements are from our work over last year in yarn-cluster mode. So I am biased on the yarn side. It's true some of the task might be able accomplished with a separate yarn API call, the API just does not same to be that nature any more if we do that way. I had a great discussion (face to face) at data bricks today with Andrew Or to see how to address these requirements. For #3 Andrew points out that recent new feature of dynamic resource allocation make this requirement less important. As once the dynamic resource allocation is enabled, user doesn't need specify the number of executors or memories up front as before. In spark 1.x, user needs to specify these numbers, for small cluster, these jobs got killed immediately as if the memory specified is larger than yarn max memory. Also we were hoping to dynamically determine the executors and memory needed based on the data size, but make sure they are not exceeding the max. With dynamic resource allocation, I think we can just let spark handle this dynamically. For #4 spark context status tracker can give information, but you need to pull it based on certain time interval. Some kind event based call backs would be nice. For #5 yes, it's about the command line args. These are args are the input for the spark jobs. Seems a bit too much to create a file just to specify spark job args. These args could be few thousands columns in machine learning jobs. For #6 we was thinking our needs for communication is not special to us, other applications may need this as well. But this maybe request too much changes in spark In our case, we did the followings 1) we modified the yarn client to expose yarn app listener, so it call back on events based on spark yarn report interval (default to 1 sec). This gives us the container start, app in progress, failed, killed events 2) in our own spark job, we wrap the main method with a akka actor which communicate with the actor in the application job submitter. A logger and spark job listener are created. Spark job listener send message to the logger. Logger relay the message to the application via akka actor. Std out and error are redirect to logger as well. Depending on the type of the messages, the application will update the UI (witch shows the progress bar) or log the message directly to the log file, or update the job state. We are using log4j, the issue is that in yarn cluster mode, the log are inside the cluster, not in the application, which is out side the cluster. We want to capture the cluster or error messages directly in the application log. I will put some design doc and actual code in my pull request later, as Andrew requested. This PR is unlikely to get merge in, but it will show the idea I am talking about here. Thanks for listening and responding Chester Sent from my iPad On May 14, 2015, at 18:41, Marcelo Vanzin wrote: > Hi Chester, > > Thanks for the feedback. A few of those are great candidates for improvements > to the launcher library. > > On Wed, May 13, 2015 at 5:44 AM, Chester At Work > wrote: > 1) client should not be private ( unless alternative is provided) so we > can call it directly. > > Patrick already touched on this subject, but I believe Client should be kept > private. If we want to expose functionality for code launching Spark apps, > Spark should provide an interface for that so that other cluster managers can > benefit. It also keeps the API more consistent (everybody uses the same API > regardless of what's the underlying cluster manager). > > 2) we need a way to stop the running yarn app programmatically ( the PR > is already submitted) > > My first reaction to this was "with the app id, you can talk to YARN directly > and do that". But given what I wrote above, I guess it would make sense for > something like this to be exposed through the library too. > > 3) before we start the spark job, we should have a call back to the > application, which will provide the yarn container capacity (number of cores > and max memory ), so spark program will not set values beyond max values (PR > submitted) > > I'm not sure exactly what you mean here, but it feels like we're starting to > get into "wrapping the YARN API" territory. Someone who really cares about > that information can easily fetch it from YARN, the same way Spark would. > > 4) call back could be in form of yarn app listeners, which call back > based on yarn status changes ( start, in progress, failure, complete etc), > application can react based on these events in PR) > > Exposing some sort of status for the running application does sound u
Re: Change for submitting to yarn in 1.3.1
Patrick There are several things we need, some of them already mentioned in the mailing list before. I haven't looked at the SparkLauncher code, but here are few things we need from our perspectives for Spark Yarn Client 1) client should not be private ( unless alternative is provided) so we can call it directly. 2) we need a way to stop the running yarn app programmatically ( the PR is already submitted) 3) before we start the spark job, we should have a call back to the application, which will provide the yarn container capacity (number of cores and max memory ), so spark program will not set values beyond max values (PR submitted) 4) call back could be in form of yarn app listeners, which call back based on yarn status changes ( start, in progress, failure, complete etc), application can react based on these events in PR) 5) yarn client passing arguments to spark program in the form of main program, we had experience problems when we pass a very large argument due the length limit. For example, we use json to serialize the argument and encoded, then parse them as argument. For wide columns datasets, we will run into limit. Therefore, an alternative way of passing additional larger argument is needed. We are experimenting with passing the args via a established akka messaging channel. 6) spark yarn client in yarn-cluster mode right now is essentially a batch job with no communication once it launched. Need to establish the communication channel so that logs, errors, status updates, progress bars, execution stages etc can be displayed on the application side. We added an akka communication channel for this (working on PR ). Combined with others items in this list, we are able to redirect print and error statement to application log (outside of the hadoop cluster), so spark UI equivalent progress bar via spark listener. We can show yarn progress via yarn app listener before spark started; and status can be updated during job execution. We are also experimenting with long running job with additional spark commands and interactions via this channel. Chester Sent from my iPad On May 12, 2015, at 20:54, Patrick Wendell wrote: > Hey Kevin and Ron, > > So is the main shortcoming of the launcher library the inability to > get an app ID back from YARN? Or are there other issues here that > fundamentally regress things for you. > > It seems like adding a way to get back the appID would be a reasonable > addition to the launcher. > > - Patrick > > On Tue, May 12, 2015 at 12:51 PM, Marcelo Vanzin wrote: >> On Tue, May 12, 2015 at 11:34 AM, Kevin Markey >> wrote: >> >>> I understand that SparkLauncher was supposed to address these issues, but >>> it really doesn't. Yarn already provides indirection and an arm's length >>> transaction for starting Spark on a cluster. The launcher introduces yet >>> another layer of indirection and dissociates the Yarn Client from the >>> application that launches it. >>> >> >> Well, not fully. The launcher was supposed to solve "how to launch a Spark >> app programatically", but in the first version nothing was added to >> actually gather information about the running app. It's also limited in the >> way it works because of Spark's limitations (one context per JVM, etc). >> >> Still, adding things like this is something that is definitely in the scope >> for the launcher library; information such as app id can be useful for the >> code launching the app, not just in yarn mode. We just have to find a clean >> way to provide that information to the caller. >> >> >>> I am still reading the newest code, and we are still researching options >>> to move forward. If there are alternatives, we'd like to know. >>> >>> >> Super hacky, but if you launch Spark as a child process you could parse the >> stderr and get the app ID. >> >> -- >> Marcelo > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Using CUDA within Spark / boosting linear algebra
Reyonld, Prof Canny gives me the slides yesterday I will posted the link to the slides to both SF BIg Analytics and SF Machine Learning meetups. Chester Sent from my iPad On Mar 12, 2015, at 22:53, Reynold Xin wrote: > Thanks for chiming in, John. I missed your meetup last night - do you have > any writeups or slides about roofline design? In particular, I'm curious > about what optimizations are available for power-law dense * sparse? (I > don't have any background in optimizations) > > > > On Thu, Mar 12, 2015 at 8:50 PM, jfcanny wrote: > >> If you're contemplating GPU acceleration in Spark, its important to look >> beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the >> datasets we've tested in BIDMach, and we've tried to make them >> representative of industry machine learning workloads. Unless you're >> crunching images or audio, the majority of data will be very sparse and >> power law distributed. You need a good sparse BLAS, and in practice it >> seems >> like you need a sparse BLAS tailored for power-law data. We had to write >> our >> own since the NVIDIA libraries didnt perform well on typical power-law >> data. >> Intel MKL sparse BLAS also have issues and we only use some of them. >> >> You also need 2D reductions, scan operations, slicing, element-wise >> transcendental functions and operators, many kinds of sort, random number >> generators etc, and some kind of memory management strategy. Some of this >> was layered on top of Thrust in BIDMat, but most had to be written from >> scratch. Its all been rooflined, typically to memory throughput of current >> GPUs (around 200 GB/s). >> >> When you have all this you can write Learning Algorithms in the same >> high-level primitives available in Breeze or Numpy/Scipy. Its literally the >> same in BIDMat, since the generic matrix operations are implemented on both >> CPU and GPU, so the same code runs on either platform. >> >> A lesser known fact is that GPUs are around 10x faster for *all* those >> operations, not just dense BLAS. Its mostly due to faster streaming memory >> speeds, but some kernels (random number generation and transcendentals) are >> more than an order of magnitude thanks to some specialized hardware for >> power series on the GPU chip. >> >> When you have all this there is no need to move data back and forth across >> the PCI bus. The CPU only has to pull chunks of data off disk, unpack them, >> and feed them to the available GPUs. Most models fit comfortably in GPU >> memory these days (4-12 GB). With minibatch algorithms you can push TBs of >> data through the GPU this way. >> >> >> >> -- >> View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)
gen-idea should work. I use it all the time. But use the approach that works for you Sent from my iPad On Nov 18, 2014, at 11:12 PM, "Yiming \(John\) Zhang" wrote: > Hi Chester, thank you for your reply. But I tried this approach and it > failed. It seems that there are more difficulty using sbt in IntelliJ than > expected. > > And according to some references "# sbt/sbt gen-idea" is not necessary > (after Spark-1.0.0?), you can simply import the spark project and IntelliJ > will automatically generate the dependencies (but as described here, with > some possible mistakes that may fail the compilation). > > Cheers, > Yiming > > -邮件原件- > 发件人: Chester @work [mailto:ches...@alpinenow.com] > 发送时间: 2014年11月19日 13:00 > 收件人: Chen He > 抄送: sdi...@gmail.com; dev@spark.apache.org > 主题: Re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for > beginners) > > For sbt > You can simplify run > sbt/sbt gen-idea > > To generate the IntelliJ idea project module for you. You can the just open > the generated project, which includes all the needed dependencies > > Sent from my iPhone > >> On Nov 18, 2014, at 8:26 PM, Chen He wrote: >> >> Thank you Yiming. It is helpful. >> >> Regards! >> >> Chen >> >> On Tue, Nov 18, 2014 at 8:00 PM, Yiming (John) Zhang >> >> wrote: >> >>> Hi, >>> >>> >>> >>> I noticed it is hard to find a thorough introduction to using >>> IntelliJ to debug SPARK-1.1 Apps with mvn/sbt, which is not >>> straightforward for beginners. So I spent several days to figure it >>> out and hope that it would be helpful for beginners like me and that >>> professionals can help me improve it. (The intro with figures can be > found at: >>> http://kylinx.com/spark/Debug-Spark-in-IntelliJ.htm) >>> >>> >>> >>> (1) Install the Scala plugin >>> >>> >>> >>> (2) Download, unzip and open spark-1.1.0 in IntelliJ >>> >>> a) mvn: File -> Open. >>> >>> Select the Spark source folder (e.g., /root/spark-1.1.0). Maybe it >>> will take a long time to download and compile a lot of things >>> >>> b) sbt: File -> Import Project. >>> >>> Select "Import project from external model", then choose SBT >>> project, click Next. Input the Spark source path (e.g., >>> /root/spark-1.1.0) for "SBT project", and select Use auto-import. >>> >>> >>> >>> (3) First compile and run spark examples in the console to ensure >>> everything OK >>> >>> # mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package >>> >>> # ./sbt/sbt assembly -Phadoop-2.2 -Dhadoop.version=2.2.0 >>> >>> >>> >>> (4) Add the compiled spark-hadoop library >>> (spark-assembly-1.1.0-hadoop2.2.0) >>> to "Libraries" (File -> Project Structure. -> Libraries -> green +). >>> And choose modules that use it (right-click the library and click >>> "Add to Modules"). It seems only spark-examples need it. >>> >>> >>> >>> (5) In the "Dependencies" page of the modules using this library, >>> ensure that the "Scope" of this library is "Compile" (File -> Project > Structure. >>> -> >>> Modules) >>> >>> (6) For sbt, it seems that we have to label the scope of all other >>> hadoop dependencies (SBT: org.apache.hadoop.hadoop-*) as "Test" (due >>> to poor Internet connection?) And this has to be done every time >>> opening IntelliJ (due to a bug?) >>> >>> >>> >>> (7) Configure debug environment (using LogQuery as an example). Run >>> -> Edit Configurations. >>> >>> Main class: org.apache.spark.examples.LogQuery >>> >>> VM options: -Dspark.master=local >>> >>> Working directory: /root/spark-1.1.0 >>> >>> Use classpath of module: spark-examples_2.10 >>> >>> Before launch: External tool: mvn >>> >>> Program: /root/Programs/apache-maven-3.2.1/bin/mvn >>> >>> Parameters: -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests >>> package >>> >>> Working directory: /root/spark-1.1.0 >>> >>> Before launch: External tool: sbt >>> >>> Program: /root/spark-1.1.0/sbt/sbt >>> >>> Parameters: -Phadoop-2.2 -Dhadoop.version=2.2.0 assembly >>> >>> Working directory: /root/spark-1.1.0 >>> >>> >>> >>> (8) Click Run -> Debug 'LogQuery' to start debugging >>> >>> >>> >>> >>> >>> Cheers, >>> >>> Yiming >>> >>> > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Random forest - is it under implementation?
Sung chung from alpine data labs presented the random Forrest implementation at Spark summit 2014. The work will be open sourced and contributed back to MLLib. Stay tuned Sent from my iPad On Jul 11, 2014, at 6:02 AM, Egor Pahomov wrote: > Hi, I have intern, who wants to implement some ML algorithm for spark. > Which algorithm would be good idea to implement(it should be not very > difficult)? I heard someone already working on random forest, but couldn't > find proof of that. > > I'm aware of new politics, where we should implement stable, good quality, > popular ML or do not do it at all. > > -- > > > > *Sincerely yoursEgor PakhomovScala Developer, Yandex*