Spark MOOC - early access

2015-05-21 Thread Marco Shaw
*Hi Spark Devs and Users,BerkeleyX and Databricks are currently developing
two Spark-related MOOC on edX (intro
,
ml
),
the first of which starts on June 1st.  Together these courses have over
75K enrolled students!To help students perform exercises course content, we
have created a Vagrant box that contains Spark and IPython (running on
Ubuntu 32-bit).  This will simplify user setup and helps us support them.
We are writing to give you early access to the VM environment and the first
assignment, and to request your help to test out the VM/assignment before
we unleash it to 75K people (see instructions below). We’ve provided
instructions below.  We’re happy to help if you have any difficulties
getting the VM setup; please feel free to contact me (marco.s...@gmail.com
)  with any issues, comments, or
questions.Sincerely,Marco ShawSpark MOOC TA_(This is being sent
as an HTML formatted email.  Some of the links have been duplicated just in
case.)1. Install VirtualBox here
 on your OS (see Windows
tutorial here 
(https://www.youtube.com/watch?v=06Sf-m64fcY
))2. Install Vagrant here
 on your OS (see Windows tutorial
here 
(https://www.youtube.com/watch?v=LZVS23BaA1I
))3) Install virtual machine
using the following steps: (see Windows tutorial here

(https://www.youtube.com/watch?v=ZuJCqHC7IYc
))a. Create a custom directory
(e.g. c:\users\marco\myvagrant or /home/marco/myvagrant)b. Download the
file

to the custom directory (NOTE: It must be named exactly "Vagrantfile" with
no extension)c. Open a DOS prompt (Windows) or terminal (Mac/Linux) to the
custom directory and issue the command "vagrant up"4) Perform basic
commands in VM as described below: (see Windows tutorial here

(https://www.youtube.com/watch?v=bkteLH77IR0
))a. To start the VM, from a
DOS prompt (Windows) or terminal (Mac/Linux), issue the command "vagrant
up".b. To stop the VM, from a DOS prompt (Windows) or terminal (Mac/Linux),
issue the command "vagrant halt".c. To erase or delete the VM, from a DOS
prompt (Windows) or terminal (Mac/Linux), issue the command "vagrant
destroy".d. Once the VM is running, to access the notebook, open a web
browser to "http://localhost:8001 ".5) Using test
notebook as described below: (see Windows tutorial here

(https://www.youtube.com/watch?v=mlfAmyF3Q-s
))a. To start the VM, from a
DOS prompt (Windows) or terminal (Mac/Linux), issue the command "vagrant
up".b. Once the VM is running, to access the notebook, open a web browser
to "http://localhost:8001 ".c. Upload this IPython
notebook:
https://raw.githubusercontent.com/spark-mooc/mooc-setup/master/vm_test_student.ipynb
.d.
Run through the notebook.6) Play around with the first MOOC assignment
(email Marco for details when you get to this point).7) Please answer the
following questionsa. What machine are you using (OS, RAM, CPU, age)?b. How
long did the entire process take?c. How long did the VM download take?
Relatedly, where are you located?d. Do you have any other
comments/suggestions?*


Need some guidance

2015-04-13 Thread Marco Shaw
**Learning the ropes**

I'm trying to grasp the concept of using the pipeline in pySpark...

Simplified example:
>>>
list=[(1,"alpha"),(1,"beta"),(1,"foo"),(1,"alpha"),(2,"alpha"),(2,"alpha"),(2,"bar"),(3,"foo")]

Desired outcome:
[(1,3),(2,2),(3,1)]

Basically for each key, I want the number of unique values.

I've tried different approaches, but am I really using Spark effectively?
I wondered if I would do something like:
>>> input=sc.parallelize(list)
>>> input.groupByKey().collect()

Then I wondered if I could do something like a foreach over each key value,
and then map the actual values and reduce them.  Pseudo-code:

input.groupbykey()
.keys
.foreach(_.values
.map(lambda x: x,1)
.reducebykey(lambda a,b:a+b)
.count()
)

I was somehow hoping that the key would get the current value of count, and
thus be the count of the unique keys, which is exactly what I think I'm
looking for.

Am I way off base on how I could accomplish this?

Marco


Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
Sudipta - Please don't ever come here or post here again.

On Thu, Jan 22, 2015 at 1:25 PM, Sudipta Banerjee <
asudipta.baner...@gmail.com> wrote:

> Hi Nicos, Taking forward your argument,please be a smart a$$ and dont use
> unprofessional language just for the sake of being a moderator.
> Paco Nathan is respected for the dignity he carries in sharing his
> knowledge and making it available free for a$$es like us right!
> So just mind your tongue next time you put such a$$ in your mouth.
>
> Best Regards,
> Sudipta
>
> On Thu, Jan 22, 2015 at 10:39 PM, Nicos Kekchidis  wrote:
>
>> Folks,
>> Just a gentle reminder we owe to ourselves:
>> - this is a public forum and we need to behave accordingly, it is not
>> place to vent frustration in rude way
>> - getting attention here is an earned privilege and not entitlement
>> - this is not a “Platinum Support” department of your vendor rather and
>> open source collaboration forum where people volunteer their time to pay
>> attention to your needs
>> - there are still many gray areas so be patient and articulate questions
>> in as much details as possible if you want to get quick help and not just
>> be perceived as a smart a$$
>>
>> FYI - Paco Nathan is a well respected Spark evangelist and many people,
>> including myself, owe to his passion for jumping on Spark platform promise.
>> People like Sean Owen keep us believing in things when we feel like hitting
>> the dead-end.
>>
>> Please, be respectful of what connections you are prized with and act
>> civilized.
>>
>> Have a great day!
>> - Nicos
>>
>>
>> > On Jan 22, 2015, at 7:49 AM, Sean Owen  wrote:
>> >
>> > Yes, this isn't a well-formed question, and got maybe the response it
>> > deserved, but the tone is veering off the rails. I just got a much
>> > ruder reply from Sudipta privately, which I will not forward. Sudipta,
>> > I suggest you take the responses you've gotten so far as about as much
>> > answer as can be had here and do some work yourself, and come back
>> > with much more specific questions, and it will all be helpful and
>> > polite again.
>> >
>> > On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee
>> >  wrote:
>> >> Hi Marco,
>> >>
>> >> Thanks for the confirmation. Please let me know what are the lot more
>> detail
>> >> you need to answer a very specific question  WHAT IS THE MINIMUM
>> HARDWARE
>> >> CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN  on a
>> system?
>> >> Please let me know if you need any further information and if you dont
>> know
>> >> please drive across with the $1 to Sir Paco Nathan and get me the
>> >> answer.
>> >>
>> >> Thanks and Regards,
>> >> Sudipta
>> >>
>> >> On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw 
>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> Let me reword your request so you understand how (too) generic your
>> >>> question is
>> >>>
>> >>> "Hi, I have $10,000, please find me some means of transportation so I
>> can
>> >>> get to work."
>> >>>
>> >>> Please provide (a lot) more details. If you can't, consider using one
>> of
>> >>> the pre-built express VMs from either Cloudera, Hortonworks or MapR,
>> for
>> >>> example.
>> >>>
>> >>> Marco
>> >>>
>> >>>
>> >>>
>> >>>> On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee
>> >>>>  wrote:
>> >>>>
>> >>>>
>> >>>>
>> >>>> Hi Apache-Spark team ,
>> >>>>
>> >>>> What are the system requirements installing Hadoop and Apache Spark?
>> >>>> I have attached the screen shot of Gparted.
>> >>>>
>> >>>>
>> >>>> Thanks and regards,
>> >>>> Sudipta
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Sudipta Banerjee
>> >>>> Consultant, Business Analytics and Cloud Based Architecture
>> >>>> Call me +919019578099
>> >>>> 
>> >>>>
>> >>>> -
>> >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Sudipta Banerjee
>> >> Consultant, Business Analytics and Cloud Based Architecture
>> >> Call me +919019578099
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>> >
>>
>>
>
>
> --
> Sudipta Banerjee
> Consultant, Business Analytics and Cloud Based Architecture
> Call me +919019578099
>


Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
(Starting over...)

The best place to look for the requirements would be at the individual
pages of each technology.

As for absolute minimum requirements, I would suggest 50GB of disk space
and at least 8GB of memory.  This is the absolute minimum.

"Architecting" a solution like you are looking for is very complex.  If you
are just looking for a proof-of-concept consider a Docker image or going to
Cloudera/Hortonworks/MapR and look for their "express VMs" which can
usually run on Oracle Virtualbox or VMware.

Marco


On Thu, Jan 22, 2015 at 7:36 AM, Sudipta Banerjee <
asudipta.baner...@gmail.com> wrote:

>
>
> Hi Apache-Spark team ,
>
> What are the system requirements installing Hadoop and Apache Spark?
> I have attached the screen shot of Gparted.
>
>
> Thanks and regards,
> Sudipta
>
>
>
>
> --
> Sudipta Banerjee
> Consultant, Business Analytics and Cloud Based Architecture
> Call me +919019578099
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
Hi,

Let me reword your request so you understand how (too) generic your question 
is

"Hi, I have $10,000, please find me some means of transportation so I can get 
to work."

Please provide (a lot) more details. If you can't, consider using one of the 
pre-built express VMs from either Cloudera, Hortonworks or MapR, for example. 

Marco



> On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee  
> wrote:
> 
> 
> 
> Hi Apache-Spark team ,
> 
> What are the system requirements installing Hadoop and Apache Spark?
> I have attached the screen shot of Gparted.
> 
> 
> Thanks and regards,
> Sudipta 
> 
> 
> 
> 
> -- 
> Sudipta Banerjee
> Consultant, Business Analytics and Cloud Based Architecture 
> Call me +919019578099
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: DeepLearning and Spark ?

2015-01-09 Thread Marco Shaw
Pretty vague on details:

http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A227199


> On Jan 9, 2015, at 11:39 AM, Jaonary Rabarisoa  wrote:
> 
> Hi all,
> 
> DeepLearning algorithms are popular and achieve many state of the art 
> performance in several real world machine learning problems. Currently there 
> are no DL implementation in spark and I wonder if there is an ongoing work on 
> this topics.
> 
> We can do DL in spark Sparkling water and H2O but this adds an additional 
> software stack.
> 
> Deeplearning4j seems to implements a distributed version of many popural DL 
> algorithm. Porting DL4j in Spark can be interesting.
> 
> Google describes an implementation of a large scale DL in this paper 
> http://research.google.com/archive/large_deep_networks_nips2012.html. Based 
> on model parallelism and data parallelism.
> 
> So, I'm trying to imaging what should be a good design for DL algorithm in 
> Spark ? Spark already have RDD (for data parallelism). Can GraphX be used for 
> the model parallelism (as DNN are generally designed as DAG) ? And what about 
> using GPUs to do local parallelism (mecanism to push partition into GPU 
> memory ) ? 
> 
> 
> What do you think about this ?
> 
> 
> Cheers,
> 
> Jao
> 


Re: when will the spark 1.3.0 be released?

2014-12-16 Thread Marco Shaw
When it is ready. 



> On Dec 16, 2014, at 11:43 PM, 张建轶  wrote:
> 
> Hi £¡
> 
> when will the spark 1.3.0 be released£¿
> I want to use new LDA feature.
> Thank 
> you!B‹CB•È[œÝXœØÜšX™KK[XZ[ˆ\Ù\‹][œÝXœØÜšX™PÜ\šË˜\XÚK›Ü™ÃB‘›ÜˆY][Û˜[ÛÛ[X[™ËK[XZ[ˆ\Ù\‹Z[Ü\šË˜\XÚK›Ü™ÃBƒB

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Starting with spark

2014-07-24 Thread Marco Shaw
First thing...  Go into the Cloudera Manager and make sure that the Spark
service (master?) is started.

Marco


On Thu, Jul 24, 2014 at 7:53 AM, Sameer Sayyed 
wrote:

> Hello All,
>
> I am new user of spark, I am using *cloudera-quickstart-vm-5.0.0-0-vmware*
> for execute sample examples of Spark.
> I am very sorry for silly and basic question.
> I am not able to deploy and execute sample examples of spark.
>
> please suggest me *how to start with spark*.
>
> Please help me
> Thanks in advance.
>
> Regards,
> Sam
>


Re: Running Spark on Microsoft Azure HDInsight

2014-07-14 Thread Marco Shaw
Looks like going with cluster mode is not a good idea:
http://azure.microsoft.com/en-us/documentation/articles/hdinsight-administer-use-management-portal/

Seems like a non-HDInsight VM might be needed to make it the Spark master
node.

Marco



On Mon, Jul 14, 2014 at 12:43 PM, Marco Shaw  wrote:

> I'm a Spark and HDInsight novice, so I could be wrong...
>
> HDInsight is based on HDP2, so my guess here is that you have the option
> of installing/configuring Spark in cluster mode (YARN) or in standalone
> mode and package the Spark binaries with your job.
>
> Everything I seem to look at is related to UNIX shell scripts.  So, one
> might need to pull apart some of these scripts to pick out how to run this
> on Windows.
>
> Interesting project...
>
> Marco
>
>
>
> On Mon, Jul 14, 2014 at 8:00 AM, Niek Tax  wrote:
>
>> Hi everyone,
>>
>> Currently I am working on parallelizing a machine learning algorithm
>> using a Microsoft HDInsight cluster. I tried running my algorithm on Hadoop
>> MapReduce, but since my algorithm is iterative the job scheduling overhead
>> and data loading overhead severely limits the performance of my algorithm
>> in terms of training time.
>>
>> Since recently, HDInsight supports Hadoop 2 with YARN, which I thought
>> would allow me to use run Spark jobs, which seem more fitting for my task. So
>> far I have not been able however to find how I can run Apache Spark jobs on
>> a HDInsight cluster.
>>
>> It seems like remote job submission (which would have my preference) is
>> not possible for Spark on HDInsight, as REST endpoints for Oozie and
>> templeton do not seem to support submission of Spark jobs. I also tried to
>> RDP to the headnode for job submission from the headnode. On the headnode
>> drives I can find other new YARN computation models like Tez and I also
>> managed to run Tez jobs on it through YARN. However, Spark seems to be
>> missing. Does this mean that HDInsight currently does not support Spark,
>> even though it supports Hadoop versions with YARN? Or do I need to install
>> Spark on the HDInsight cluster first, in some way? Or is there maybe
>> something else that I'm missing and can I run Spark jobs on HDInsight some
>> other way?
>>
>> Many thanks in advance!
>>
>>
>> Kind regards,
>>
>> Niek Tax
>>
>
>


Re: Running Spark on Microsoft Azure HDInsight

2014-07-14 Thread Marco Shaw
I'm a Spark and HDInsight novice, so I could be wrong...

HDInsight is based on HDP2, so my guess here is that you have the option of
installing/configuring Spark in cluster mode (YARN) or in standalone mode
and package the Spark binaries with your job.

Everything I seem to look at is related to UNIX shell scripts.  So, one
might need to pull apart some of these scripts to pick out how to run this
on Windows.

Interesting project...

Marco



On Mon, Jul 14, 2014 at 8:00 AM, Niek Tax  wrote:

> Hi everyone,
>
> Currently I am working on parallelizing a machine learning algorithm using
> a Microsoft HDInsight cluster. I tried running my algorithm on Hadoop
> MapReduce, but since my algorithm is iterative the job scheduling overhead
> and data loading overhead severely limits the performance of my algorithm
> in terms of training time.
>
> Since recently, HDInsight supports Hadoop 2 with YARN, which I thought
> would allow me to use run Spark jobs, which seem more fitting for my task. So
> far I have not been able however to find how I can run Apache Spark jobs on
> a HDInsight cluster.
>
> It seems like remote job submission (which would have my preference) is
> not possible for Spark on HDInsight, as REST endpoints for Oozie and
> templeton do not seem to support submission of Spark jobs. I also tried to
> RDP to the headnode for job submission from the headnode. On the headnode
> drives I can find other new YARN computation models like Tez and I also
> managed to run Tez jobs on it through YARN. However, Spark seems to be
> missing. Does this mean that HDInsight currently does not support Spark,
> even though it supports Hadoop versions with YARN? Or do I need to install
> Spark on the HDInsight cluster first, in some way? Or is there maybe
> something else that I'm missing and can I run Spark jobs on HDInsight some
> other way?
>
> Many thanks in advance!
>
>
> Kind regards,
>
> Niek Tax
>


Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Marco Shaw
That is confusing based on the context you provided. 

This might take more time than I can spare to try to understand. 

For sure, you need to add Spark to run it in/on the HDP 2.1 express VM. 

Cloudera's CDH 5 express VM includes Spark, but the service isn't running by 
default. 

I can't remember for MapR...

Marco

> On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev 
>  wrote:
> 
> Marco,
> 
> Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can 
> try
> from
> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf  
> HDP 2.1 means YARN, at the same time they propose ti install rpm
> 
> On other hand, http://spark.apache.org/ said "
> Integrated with Hadoop
> Spark can run on Hadoop 2's YARN cluster manager, and can read any existing 
> Hadoop data.
> 
> If you have a Hadoop 2 cluster, you can run Spark without any installation 
> needed. "
> 
> And this is confusing for me... do I need rpm installation on not?...
> 
> 
> Thank you,
> Konstantin Kudryavtsev
> 
> 
>> On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw  wrote:
>> Can you provide links to the sections that are confusing?
>> 
>> My understanding, the HDP1 binaries do not need YARN, while the HDP2 
>> binaries do. 
>> 
>> Now, you can also install Hortonworks Spark RPM...
>> 
>> For production, in my opinion, RPMs are better for manageability. 
>> 
>>> On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev 
>>>  wrote:
>>> 
>>> Hello, thanks for your message... I'm confused, Hortonworhs suggest install 
>>> spark rpm on each node, but on Spark main page said that yarn enough and I 
>>> don't need to install it... What the difference?
>>> 
>>> sent from my HTC
>>> 
>>>> On Jul 6, 2014 8:34 PM, "vs"  wrote:
>>>> Konstantin,
>>>> 
>>>> HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try
>>>> from
>>>> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
>>>> 
>>>> Let me know if you see issues with the tech preview.
>>>> 
>>>> "spark PI example on HDP 2.0
>>>> 
>>>> I downloaded spark 1.0 pre-build from 
>>>> http://spark.apache.org/downloads.html
>>>> (for HDP2)
>>>> The run example from spark web-site:
>>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>>>> yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g
>>>> --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
>>>> 
>>>> I got error:
>>>> Application application_1404470405736_0044 failed 3 times due to AM
>>>> Container for appattempt_1404470405736_0044_03 exited with exitCode: 1
>>>> due to: Exception from container-launch:
>>>> org.apache.hadoop.util.Shell$ExitCodeException:
>>>> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
>>>> at org.apache.hadoop.util.Shell.run(Shell.java:379)
>>>> at 
>>>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
>>>> at
>>>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>>>> at
>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
>>>> at
>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> at java.lang.Thread.run(Thread.java:744)
>>>> .Failing this attempt.. Failing the application.
>>>> 
>>>> Unknown/unsupported param List(--executor-memory, 2048, --executor-cores, 
>>>> 1,
>>>> --num-executors, 3)
>>>> Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
>>>> Options:
>>>>   --jar JAR_PATH   Path to your application's JAR file (required)
>>>>   --class CLASS_NAME   Name of your application's main class (required)
>>>> ...bla-bla-bla
>>>> "
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context: 
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 


Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Marco Shaw
Can you provide links to the sections that are confusing?

My understanding, the HDP1 binaries do not need YARN, while the HDP2 binaries 
do. 

Now, you can also install Hortonworks Spark RPM...

For production, in my opinion, RPMs are better for manageability. 

> On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev 
>  wrote:
> 
> Hello, thanks for your message... I'm confused, Hortonworhs suggest install 
> spark rpm on each node, but on Spark main page said that yarn enough and I 
> don't need to install it... What the difference?
> 
> sent from my HTC
> 
>> On Jul 6, 2014 8:34 PM, "vs"  wrote:
>> Konstantin,
>> 
>> HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can try
>> from
>> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
>> 
>> Let me know if you see issues with the tech preview.
>> 
>> "spark PI example on HDP 2.0
>> 
>> I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html
>> (for HDP2)
>> The run example from spark web-site:
>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>> yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g
>> --executor-cores 1 ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 2
>> 
>> I got error:
>> Application application_1404470405736_0044 failed 3 times due to AM
>> Container for appattempt_1404470405736_0044_03 exited with exitCode: 1
>> due to: Exception from container-launch:
>> org.apache.hadoop.util.Shell$ExitCodeException:
>> at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
>> at org.apache.hadoop.util.Shell.run(Shell.java:379)
>> at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
>> at
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>> at
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
>> at
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>> .Failing this attempt.. Failing the application.
>> 
>> Unknown/unsupported param List(--executor-memory, 2048, --executor-cores, 1,
>> --num-executors, 3)
>> Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
>> Options:
>>   --jar JAR_PATH   Path to your application's JAR file (required)
>>   --class CLASS_NAME   Name of your application's main class (required)
>> ...bla-bla-bla
>> "
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-run-Spark-1-0-SparkPi-on-HDP-2-0-tp8802p8873.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Marco Shaw
They are recorded...  For example, 2013: http://spark-summit.org/2013

I'm assuming the 2014 videos will be up in 1-2 weeks.

Marco


On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta 
wrote:

> Are these sessions recorded ?
>
>
> On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos  wrote:
>
>>
>>
>>
>>
>>
>>
>> *General Session / Keynotes :
>>  http://www.ustream.tv/channel/spark-summit-2014
>>  Track A
>> : http://www.ustream.tv/channel/track-a1
>> Track
>> B: http://www.ustream.tv/channel/track-b1
>>  Track
>> C: http://www.ustream.tv/channel/track-c1
>> *
>>
>>
>> On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha 
>> wrote:
>>
>>> I attended yesterday on ustream.tv, but can't find the links to today's
>>> streams anywhere. help!
>>>
>>> --
>>> Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
>>>
>>
>>
>


Re: Spark vs Google cloud dataflow

2014-06-27 Thread Marco Shaw
Sorry. Never mind...  I guess that's what "Summingbird" is all about. Never 
heard of it. 

> On Jun 27, 2014, at 7:10 PM, Marco Shaw  wrote:
> 
> Dean: Some interesting information... Do you know where I can read more about 
> these coming changes to Scalding/Cascading?
> 
>> On Jun 27, 2014, at 9:40 AM, Dean Wampler  wrote:
>> 
>> ... and to be clear on the point, Summingbird is not limited to MapReduce. 
>> It abstracts over Scalding (which abstracts over Cascading, which is being 
>> moved from MR to Spark) and over Storm for event processing.
>> 
>> 
>>> On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen  wrote:
>>> On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia  
>>> wrote:
>>> > Summingbird is for map/reduce. Dataflow is the third generation of 
>>> > google's
>>> > map/reduce, and it generalizes map/reduce the way Spark does. See more 
>>> > about
>>> > this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s
>>> 
>>> Yes, my point was that Summingbird is similar in that it is a
>>> higher-level service for batch/streaming computation, not that it is
>>> similar for being MapReduce-based.
>>> 
>>> > It seems Dataflow is based on this paper:
>>> > http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf
>>> 
>>> FlumeJava maps to Crunch in the Hadoop ecosystem. I think Dataflows is
>>> more than that but yeah that seems to be some of the 'language'. It is
>>> similar in that it is a distributed collection abstraction.
>> 
>> 
>> 
>> -- 
>> Dean Wampler, Ph.D.
>> Typesafe
>> @deanwampler
>> http://typesafe.com
>> http://polyglotprogramming.com


Re: Spark vs Google cloud dataflow

2014-06-27 Thread Marco Shaw
Dean: Some interesting information... Do you know where I can read more about 
these coming changes to Scalding/Cascading?

> On Jun 27, 2014, at 9:40 AM, Dean Wampler  wrote:
> 
> ... and to be clear on the point, Summingbird is not limited to MapReduce. It 
> abstracts over Scalding (which abstracts over Cascading, which is being moved 
> from MR to Spark) and over Storm for event processing.
> 
> 
>> On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen  wrote:
>> On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia  
>> wrote:
>> > Summingbird is for map/reduce. Dataflow is the third generation of google's
>> > map/reduce, and it generalizes map/reduce the way Spark does. See more 
>> > about
>> > this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s
>> 
>> Yes, my point was that Summingbird is similar in that it is a
>> higher-level service for batch/streaming computation, not that it is
>> similar for being MapReduce-based.
>> 
>> > It seems Dataflow is based on this paper:
>> > http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf
>> 
>> FlumeJava maps to Crunch in the Hadoop ecosystem. I think Dataflows is
>> more than that but yeah that seems to be some of the 'language'. It is
>> similar in that it is a distributed collection abstraction.
> 
> 
> 
> -- 
> Dean Wampler, Ph.D.
> Typesafe
> @deanwampler
> http://typesafe.com
> http://polyglotprogramming.com


Re: How to Run Machine Learning Examples

2014-05-22 Thread Marco Shaw
About run-example, I've tried MapR, Hortonworks and Cloudera distributions with 
there Spark packages and none seem to package it. 

Am I missing something?  Is this only provided with the Spark project pre-built 
binaries or from source installs?

Marco

> On May 22, 2014, at 5:04 PM, Stephen Boesch  wrote:
> 
> 
> There is a bin/run-example.sh  []
> 
> 
> 2014-05-22 12:48 GMT-07:00 yxzhao :
>> I want to run the LR, SVM, and NaiveBayes algorithms implemented in the
>> following directory on my data set. But I did not find the sample command
>> line to run them. Anybody help? Thanks.
>> spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Run-Machine-Learning-Examples-tp6277.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 


Express VMs - good idea?

2014-05-14 Thread Marco Shaw
Hi,

I've wanted to play with Spark.  I wanted to fast track things and just use
one of the vendor's "express VMs".  I've tried Cloudera CDH 5.0 and
Hortonworks HDP 2.1.

I've not written down all of my issues, but for certain, when I try to run
spark-shell it doesn't work.  Cloudera seems to crash, and both complain
when I try to use "SparkContext" in a simple Scala command.

So, just a basic question on whether anyone has had success getting these
express VMs to work properly with Spark *out of the box* (HDP does required
you install Spark manually).

I know Cloudera recommends 8GB of RAM, but I've been running it with 4GB.

Could it be that 4GB is just not enough, and causing issues or have others
had success using these Hadoop 2.x pre-built VMs with Spark 0.9.x?

Marco