Spark MOOC - early access

2015-05-21 Thread Marco Shaw
*Hi Spark Devs and Users,BerkeleyX and Databricks are currently developing two Spark-related MOOC on edX (intro https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x, ml https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x), the first of which

Need some guidance

2015-04-13 Thread Marco Shaw
**Learning the ropes** I'm trying to grasp the concept of using the pipeline in pySpark... Simplified example: list=[(1,alpha),(1,beta),(1,foo),(1,alpha),(2,alpha),(2,alpha),(2,bar),(3,foo)] Desired outcome: [(1,3),(2,2),(3,1)] Basically for each key, I want the number of unique values. I've

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
Hi, Let me reword your request so you understand how (too) generic your question is Hi, I have $10,000, please find me some means of transportation so I can get to work. Please provide (a lot) more details. If you can't, consider using one of the pre-built express VMs from either

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
know if you need any further information and if you dont know please drive across with the $1 to Sir Paco Nathan and get me the answer. Thanks and Regards, Sudipta On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw marco.s...@gmail.com wrote: Hi, Let me reword your request so

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Marco Shaw
(Starting over...) The best place to look for the requirements would be at the individual pages of each technology. As for absolute minimum requirements, I would suggest 50GB of disk space and at least 8GB of memory. This is the absolute minimum. Architecting a solution like you are looking

Re: DeepLearning and Spark ?

2015-01-09 Thread Marco Shaw
Pretty vague on details: http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A227199 On Jan 9, 2015, at 11:39 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, DeepLearning algorithms are popular and achieve many state of the art performance in several real world

Re: when will the spark 1.3.0 be released?

2014-12-16 Thread Marco Shaw
When it is ready. On Dec 16, 2014, at 11:43 PM, 张建轶 zhangjia...@youku.com wrote: Hi £¡ when will the spark 1.3.0 be released£¿ I want to use new LDA feature. Thank you!B‹CB•È[œÝXœØÜšX™KK[XZ[ˆ\Ù\‹][œÝXœØÜšX™P

Re: Starting with spark

2014-07-24 Thread Marco Shaw
First thing... Go into the Cloudera Manager and make sure that the Spark service (master?) is started. Marco On Thu, Jul 24, 2014 at 7:53 AM, Sameer Sayyed sam.sayyed...@gmail.com wrote: Hello All, I am new user of spark, I am using *cloudera-quickstart-vm-5.0.0-0-vmware* for execute

Re: Running Spark on Microsoft Azure HDInsight

2014-07-14 Thread Marco Shaw
I'm a Spark and HDInsight novice, so I could be wrong... HDInsight is based on HDP2, so my guess here is that you have the option of installing/configuring Spark in cluster mode (YARN) or in standalone mode and package the Spark binaries with your job. Everything I seem to look at is related to

Re: Running Spark on Microsoft Azure HDInsight

2014-07-14 Thread Marco Shaw
Looks like going with cluster mode is not a good idea: http://azure.microsoft.com/en-us/documentation/articles/hdinsight-administer-use-management-portal/ Seems like a non-HDInsight VM might be needed to make it the Spark master node. Marco On Mon, Jul 14, 2014 at 12:43 PM, Marco Shaw marco.s

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Marco Shaw
Can you provide links to the sections that are confusing? My understanding, the HDP1 binaries do not need YARN, while the HDP2 binaries do. Now, you can also install Hortonworks Spark RPM... For production, in my opinion, RPMs are better for manageability. On Jul 6, 2014, at 5:39 PM,

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Marco Shaw
installation needed. And this is confusing for me... do I need rpm installation on not?... Thank you, Konstantin Kudryavtsev On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw marco.s...@gmail.com wrote: Can you provide links to the sections that are confusing? My understanding, the HDP1

Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Marco Shaw
They are recorded... For example, 2013: http://spark-summit.org/2013 I'm assuming the 2014 videos will be up in 1-2 weeks. Marco On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Are these sessions recorded ? On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Marco Shaw
Dean: Some interesting information... Do you know where I can read more about these coming changes to Scalding/Cascading? On Jun 27, 2014, at 9:40 AM, Dean Wampler deanwamp...@gmail.com wrote: ... and to be clear on the point, Summingbird is not limited to MapReduce. It abstracts over

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Marco Shaw
Sorry. Never mind... I guess that's what Summingbird is all about. Never heard of it. On Jun 27, 2014, at 7:10 PM, Marco Shaw marco.s...@gmail.com wrote: Dean: Some interesting information... Do you know where I can read more about these coming changes to Scalding/Cascading? On Jun 27

Re: How to Run Machine Learning Examples

2014-05-22 Thread Marco Shaw
About run-example, I've tried MapR, Hortonworks and Cloudera distributions with there Spark packages and none seem to package it. Am I missing something? Is this only provided with the Spark project pre-built binaries or from source installs? Marco On May 22, 2014, at 5:04 PM, Stephen

Express VMs - good idea?

2014-05-14 Thread Marco Shaw
Hi, I've wanted to play with Spark. I wanted to fast track things and just use one of the vendor's express VMs. I've tried Cloudera CDH 5.0 and Hortonworks HDP 2.1. I've not written down all of my issues, but for certain, when I try to run spark-shell it doesn't work. Cloudera seems to crash,