Re: Change for submitting to yarn in 1.3.1

2015-05-14 Thread Chester At Work
Marcelo
 Thanks for the comments. All my requirements are from our work over last 
year in yarn-cluster mode. So I am biased on the yarn side.

  It's true some of the task might be able accomplished  with a separate 
yarn API call, the API just does not same to be that nature any more if we do 
that way.

   I had a great discussion (face to face) at data bricks today with Andrew 
Or to see how to address these requirements.

   For #3 Andrew points out that recent new feature of dynamic resource 
allocation make this requirement less important. As once the dynamic resource 
allocation is enabled, user doesn't need specify the number of executors or 
memories up front as before. In spark 1.x, user needs to specify these numbers, 
for small cluster, these jobs got killed immediately as if the memory specified 
is larger than yarn max memory. Also we were hoping to dynamically determine 
the executors and memory needed based on the data size, but make sure they are 
not exceeding the max.
 With dynamic resource allocation, I think we can just let spark handle 
this dynamically.

   For #4 spark context status tracker can give information, but you need 
to pull it based on certain time interval. Some kind event based call backs 
would be nice.

   For #5 yes, it's about the command line args. These are args are the 
input for the spark jobs. Seems a bit too much to create a file just to specify 
spark job args. These args could be few thousands columns in machine learning 
jobs.

   For #6 we was thinking our needs for communication is not special to us, 
 other applications may need this as well. But this maybe request too much 
changes in  spark

   In our case, we did the followings 
 1) we modified the yarn client to expose yarn app listener, so it call 
back on events based on spark yarn report interval (default to 1 sec). This 
gives us the container start, app in progress, failed, killed events

 2) in our own spark job, we wrap the main method with a akka actor 
which communicate with the actor in the application job submitter. A logger and 
spark job listener are created. Spark job listener send message to the logger. 
Logger relay the message to the application via akka actor. Std out and error 
are redirect to logger as well. Depending on the type of the messages, the 
application will update the UI (witch shows the progress bar) or log the 
message directly to the log file, or update the job state. We are using log4j, 
the issue is that in yarn cluster mode, the log are inside the cluster, not in 
the application, which is out side the cluster. We want to capture the cluster 
or error messages directly in the application log.
  
   I will put some design doc and actual code in my pull request later, as 
Andrew requested. This PR is unlikely to get merge in, but it will show the 
idea I am talking about here.

 Thanks for listening and responding 

Chester

Sent from my iPad

On May 14, 2015, at 18:41, Marcelo Vanzin  wrote:

> Hi Chester,
> 
> Thanks for the feedback. A few of those are great candidates for improvements 
> to the launcher library.
> 
> On Wed, May 13, 2015 at 5:44 AM, Chester At Work  
> wrote:
>  1) client should not be private ( unless alternative is provided) so we 
> can call it directly.
> 
> Patrick already touched on this subject, but I believe Client should be kept 
> private. If we want to expose functionality for code launching Spark apps, 
> Spark should provide an interface for that so that other cluster managers can 
> benefit. It also keeps the API more consistent (everybody uses the same API 
> regardless of what's the underlying cluster manager).
>  
>  2) we need a way to stop the running yarn app programmatically ( the PR 
> is already submitted)
> 
> My first reaction to this was "with the app id, you can talk to YARN directly 
> and do that". But given what I wrote above, I guess it would make sense for 
> something like this to be exposed through the library too.
>  
>  3) before we start the spark job, we should have a call back to the 
> application, which will provide the yarn container capacity (number of cores 
> and max memory ), so spark program will not set values beyond max values (PR 
> submitted)
> 
> I'm not sure exactly what you mean here, but it feels like we're starting to 
> get into "wrapping the YARN API" territory. Someone who really cares about 
> that information can easily fetch it from YARN, the same way Spark would.
>  
>  4) call back could be in form of yarn app listeners, which call back 
> based on yarn status changes ( start, in progress, failure, complete etc), 
> application can react based on these events in PR)
> 
> Exposing some sort of status for the running application does sound u

Re: Change for submitting to yarn in 1.3.1

2015-05-13 Thread Chester At Work
Patrick 
 There are several things we need, some of them already mentioned in the 
mailing list before. 

I haven't looked at the SparkLauncher code, but here are few things we need 
from our perspectives for Spark Yarn Client

 1) client should not be private ( unless alternative is provided) so we 
can call it directly.
 2) we need a way to stop the running yarn app programmatically ( the PR is 
already submitted) 
 3) before we start the spark job, we should have a call back to the 
application, which will provide the yarn container capacity (number of cores 
and max memory ), so spark program will not set values beyond max values (PR 
submitted)
 4) call back could be in form of yarn app listeners, which call back based 
on yarn status changes ( start, in progress, failure, complete etc), 
application can react based on these events in PR)
 
 5) yarn client passing arguments to spark program in the form of main 
program, we had experience problems when we pass a very large argument due the 
length limit. For example, we use json to serialize the argument and encoded, 
then parse them as argument. For wide columns datasets, we will run into limit. 
Therefore, an alternative way of passing additional larger argument is needed. 
We are experimenting with passing the args via a established akka messaging 
channel. 

6) spark yarn client in yarn-cluster mode right now is essentially a batch 
job with no communication once it launched. Need to establish the communication 
channel so that logs, errors, status updates, progress bars, execution stages 
etc can be displayed on the application side. We added an akka communication 
channel for this (working on PR ).

   Combined with others items in this list, we are able to redirect print 
and error statement to application log (outside of the hadoop cluster), so 
spark UI equivalent progress bar via spark listener. We can show yarn progress 
via yarn app listener before spark started; and status can be updated during 
job execution.

We are also experimenting with long running job with additional spark 
commands and interactions via this channel.


 Chester




 
  



Sent from my iPad

On May 12, 2015, at 20:54, Patrick Wendell  wrote:

> Hey Kevin and Ron,
> 
> So is the main shortcoming of the launcher library the inability to
> get an app ID back from YARN? Or are there other issues here that
> fundamentally regress things for you.
> 
> It seems like adding a way to get back the appID would be a reasonable
> addition to the launcher.
> 
> - Patrick
> 
> On Tue, May 12, 2015 at 12:51 PM, Marcelo Vanzin  wrote:
>> On Tue, May 12, 2015 at 11:34 AM, Kevin Markey 
>> wrote:
>> 
>>> I understand that SparkLauncher was supposed to address these issues, but
>>> it really doesn't.  Yarn already provides indirection and an arm's length
>>> transaction for starting Spark on a cluster. The launcher introduces yet
>>> another layer of indirection and dissociates the Yarn Client from the
>>> application that launches it.
>>> 
>> 
>> Well, not fully. The launcher was supposed to solve "how to launch a Spark
>> app programatically", but in the first version nothing was added to
>> actually gather information about the running app. It's also limited in the
>> way it works because of Spark's limitations (one context per JVM, etc).
>> 
>> Still, adding things like this is something that is definitely in the scope
>> for the launcher library; information such as app id can be useful for the
>> code launching the app, not just in yarn mode. We just have to find a clean
>> way to provide that information to the caller.
>> 
>> 
>>> I am still reading the newest code, and we are still researching options
>>> to move forward.  If there are alternatives, we'd like to know.
>>> 
>>> 
>> Super hacky, but if you launch Spark as a child process you could parse the
>> stderr and get the app ID.
>> 
>> --
>> Marcelo
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Using CUDA within Spark / boosting linear algebra

2015-03-13 Thread Chester At Work
Reyonld, 

Prof Canny gives me the slides yesterday I will posted the link to the 
slides to both SF BIg Analytics and SF Machine Learning meetups.

Chester

Sent from my iPad

On Mar 12, 2015, at 22:53, Reynold Xin  wrote:

> Thanks for chiming in, John. I missed your meetup last night - do you have
> any writeups or slides about roofline design? In particular, I'm curious
> about what optimizations are available for power-law dense * sparse? (I
> don't have any background in optimizations)
> 
> 
> 
> On Thu, Mar 12, 2015 at 8:50 PM, jfcanny  wrote:
> 
>> If you're contemplating GPU acceleration in Spark, its important to look
>> beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the
>> datasets we've tested in BIDMach, and we've tried to make them
>> representative of industry machine learning workloads. Unless you're
>> crunching images or audio, the majority of data will be very sparse and
>> power law distributed. You need a good sparse BLAS, and in practice it
>> seems
>> like you need a sparse BLAS tailored for power-law data. We had to write
>> our
>> own since the NVIDIA libraries didnt perform well on typical power-law
>> data.
>> Intel MKL sparse BLAS also have issues and we only use some of them.
>> 
>> You also need 2D reductions, scan operations, slicing, element-wise
>> transcendental functions and operators, many kinds of sort, random number
>> generators etc, and some kind of memory management strategy. Some of this
>> was layered on top of Thrust in BIDMat, but most had to be written from
>> scratch. Its all been rooflined, typically to memory throughput of current
>> GPUs (around 200 GB/s).
>> 
>> When you have all this you can write Learning Algorithms in the same
>> high-level primitives available in Breeze or Numpy/Scipy. Its literally the
>> same in BIDMat, since the generic matrix operations are implemented on both
>> CPU and GPU, so the same code runs on either platform.
>> 
>> A lesser known fact is that GPUs are around 10x faster for *all* those
>> operations, not just dense BLAS. Its mostly due to faster streaming memory
>> speeds, but some kernels (random number generation and transcendentals) are
>> more than an order of magnitude thanks to some specialized hardware for
>> power series on the GPU chip.
>> 
>> When you have all this there is no need to move data back and forth across
>> the PCI bus. The CPU only has to pull chunks of data off disk, unpack them,
>> and feed them to the available GPUs. Most models fit comfortably in GPU
>> memory these days (4-12 GB). With minibatch algorithms you can push TBs of
>> data through the GPU this way.
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for beginners)

2014-11-19 Thread Chester At Work
gen-idea should work.  I use it all the time. But use the approach that works 
for you



Sent from my iPad

On Nov 18, 2014, at 11:12 PM, "Yiming \(John\) Zhang"  wrote:

> Hi Chester, thank you for your reply. But I tried this approach and it
> failed. It seems that there are more difficulty using sbt in IntelliJ than
> expected.
> 
> And according to some references "# sbt/sbt gen-idea" is not necessary
> (after Spark-1.0.0?), you can simply import the spark project and IntelliJ
> will automatically generate the dependencies (but as described here, with
> some possible mistakes that may fail the compilation).
> 
> Cheers,
> Yiming
> 
> -邮件原件-
> 发件人: Chester @work [mailto:ches...@alpinenow.com] 
> 发送时间: 2014年11月19日 13:00
> 收件人: Chen He
> 抄送: sdi...@gmail.com; dev@spark.apache.org
> 主题: Re: Intro to using IntelliJ to debug SPARK-1.1 Apps with mvn/sbt (for
> beginners)
> 
> For sbt
> You can simplify run
> sbt/sbt gen-idea 
> 
> To generate the IntelliJ idea project module for you. You can the just open
> the generated project, which includes all the needed dependencies 
> 
> Sent from my iPhone
> 
>> On Nov 18, 2014, at 8:26 PM, Chen He  wrote:
>> 
>> Thank you Yiming. It is helpful.
>> 
>> Regards!
>> 
>> Chen
>> 
>> On Tue, Nov 18, 2014 at 8:00 PM, Yiming (John) Zhang 
>> 
>> wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> 
>>> I noticed it is hard to find a thorough introduction to using 
>>> IntelliJ to debug SPARK-1.1 Apps with mvn/sbt, which is not 
>>> straightforward for beginners. So I spent several days to figure it 
>>> out and hope that it would be helpful for beginners like me and that 
>>> professionals can help me improve it. (The intro with figures can be
> found at:
>>> http://kylinx.com/spark/Debug-Spark-in-IntelliJ.htm)
>>> 
>>> 
>>> 
>>> (1) Install the Scala plugin
>>> 
>>> 
>>> 
>>> (2) Download, unzip and open spark-1.1.0 in IntelliJ
>>> 
>>> a) mvn: File -> Open.
>>> 
>>>   Select the Spark source folder (e.g., /root/spark-1.1.0). Maybe it 
>>> will take a long time to download and compile a lot of things
>>> 
>>> b) sbt: File -> Import Project.
>>> 
>>>   Select "Import project from external model", then choose SBT 
>>> project, click Next. Input the Spark source path (e.g., 
>>> /root/spark-1.1.0) for "SBT project", and select Use auto-import.
>>> 
>>> 
>>> 
>>> (3) First compile and run spark examples in the console to ensure 
>>> everything OK
>>> 
>>> # mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
>>> 
>>> # ./sbt/sbt assembly -Phadoop-2.2 -Dhadoop.version=2.2.0
>>> 
>>> 
>>> 
>>> (4) Add the compiled spark-hadoop library
>>> (spark-assembly-1.1.0-hadoop2.2.0)
>>> to "Libraries" (File -> Project Structure. -> Libraries -> green +). 
>>> And choose modules that use it (right-click the library and click 
>>> "Add to Modules"). It seems only spark-examples need it.
>>> 
>>> 
>>> 
>>> (5) In the "Dependencies" page of the modules using this library, 
>>> ensure that the "Scope" of this library is "Compile" (File -> Project
> Structure.
>>> ->
>>> Modules)
>>> 
>>> (6) For sbt, it seems that we have to label the scope of all other 
>>> hadoop dependencies (SBT: org.apache.hadoop.hadoop-*) as "Test" (due 
>>> to poor Internet connection?) And this has to be done every time 
>>> opening IntelliJ (due to a bug?)
>>> 
>>> 
>>> 
>>> (7) Configure debug environment (using LogQuery as an example). Run 
>>> -> Edit Configurations.
>>> 
>>> Main class: org.apache.spark.examples.LogQuery
>>> 
>>> VM options: -Dspark.master=local
>>> 
>>> Working directory: /root/spark-1.1.0
>>> 
>>> Use classpath of module: spark-examples_2.10
>>> 
>>> Before launch: External tool: mvn
>>> 
>>>   Program: /root/Programs/apache-maven-3.2.1/bin/mvn
>>> 
>>>   Parameters: -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests 
>>> package
>>> 
>>>   Working directory: /root/spark-1.1.0
>>> 
>>> Before launch: External tool: sbt
>>> 
>>>   Program: /root/spark-1.1.0/sbt/sbt
>>> 
>>>   Parameters: -Phadoop-2.2 -Dhadoop.version=2.2.0 assembly
>>> 
>>>   Working directory: /root/spark-1.1.0
>>> 
>>> 
>>> 
>>> (8) Click Run -> Debug 'LogQuery' to start debugging
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> Yiming
>>> 
>>> 
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Random forest - is it under implementation?

2014-07-11 Thread Chester At Work
Sung chung from alpine data labs presented the random Forrest implementation at 
Spark summit 2014. The work will be open sourced and contributed back to MLLib.

Stay tuned 



Sent from my iPad

On Jul 11, 2014, at 6:02 AM, Egor Pahomov  wrote:

> Hi, I have intern, who wants to implement some ML algorithm for spark.
> Which algorithm would be good idea to implement(it should be not very
> difficult)? I heard someone already working on random forest, but couldn't
> find proof of that.
> 
> I'm aware of new politics, where we should implement stable, good quality,
> popular ML or do not do it at all.
> 
> -- 
> 
> 
> 
> *Sincerely yoursEgor PakhomovScala Developer, Yandex*