Shuffle Files

2014-03-03 Thread Usman Ghani
Where on the filesystem does spark write the shuffle files?


Re: o.a.s.u.Vector instances for equality

2014-03-03 Thread Shixiong Zhu
Vector is an enhanced Array[Double]. You can compare it like Array[Double].
E.g.,

scala> val v1 = Vector(1.0, 2.0)
v1: org.apache.spark.util.Vector = (1.0, 2.0)

scala> val v2 = Vector(1.0, 2.0)
v2: org.apache.spark.util.Vector = (1.0, 2.0)

scala> val exactResult = v1.elements.sameElements(v2.elements) // exact
comparison
exactResult: Boolean = true

scala> val delta = 1E-6
delta: Double = 1.0E-6

scala> val inexactResult = v1.elements.length == v2.elements.length &&
v1.elements.zip(v2.elements).forall { case (x, y) => (x - y).abs < delta }
// inexact comparison
inexactResult : Boolean = true

Best Regards,
Shixiong Zhu


2014-03-04 4:23 GMT+08:00 Oleksandr Olgashko :

> Hello. How should i better check two Vector's for equality?
>
> val a = new Vector(Array(1))
> val b = new Vector(Array(1))
> println(a == b)
> // false
>


Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
to be more precise, the difference depends on de-serialization overhead
from kryo for your data structures.


On Mon, Mar 3, 2014 at 8:21 PM, Koert Kuipers  wrote:

> yes, tachyon is in memory serialized, which is not as fast as cached in
> memory in spark (not serialized). the difference really depends on your job
> type.
>
>
>
> On Mon, Mar 3, 2014 at 7:10 PM, polkosity  wrote:
>
>> Thats exciting!  Will be looking into that, thanks Andrew.
>>
>> Related topic, has anyone had any experience running Spark on Tachyon
>> in-memory filesystem, and could offer their views on using it?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
yes, tachyon is in memory serialized, which is not as fast as cached in
memory in spark (not serialized). the difference really depends on your job
type.



On Mon, Mar 3, 2014 at 7:10 PM, polkosity  wrote:

> Thats exciting!  Will be looking into that, thanks Andrew.
>
> Related topic, has anyone had any experience running Spark on Tachyon
> in-memory filesystem, and could offer their views on using it?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: pyspark crash on mesos

2014-03-03 Thread Josh Rosen
Brad and I looked into this error and I have a few hunches about what might
be happening.

We didn't observe any failed tasks in the logs.  For some reason, the
Python driver is failing to acknowledge an accumulator update from a
successfully-completed task.  Our program doesn't explicitly use
accumulators, but it looks like PySpark task results always contain a
single Java accumulator with a PysparkAccumulatorParam.

DAGScheduler doesn't catch the exception thrown by the failed addInPlace
accumulator operation, which results in the entire DAGScheduler crashing
and the job freezing.

We should try to identify the root cause of the unacknowledged accumulator
updates, but in the meantime it would be a good idea to add more exception
handling to ensure that the DAGScheduler's main loop never crashes.  This
might mask the presence of bugs like this accumulator issue, but it would
prevent rare bugs from taking out the entire SparkContext (this is
especially important for job servers that share a SparkContext across
multiple jobs).  A general fix might be to add a "catch Exception" block in
handleTaskCompletion that turns uncaught exceptions into task failures, and
possibly a top-level "catch Exception" block in DAGScheduler.run() that
causes any uncaught exceptions to immediately cancel/crash all running jobs
(so we fail-fast instead of hanging).

It would be nice if there was a way to selectively enable Java-style
checked exceptions to avoid introducing these types of failure-handling
bugs.


On Mon, Mar 3, 2014 at 10:34 AM, Brad Miller wrote:

> Hi All,
>
> After switching from standalone Spark to Mesos I'm experiencing some
> instability.  I'm running pyspark interactively through iPython
> notebook, and get this crash non-deterministically (although pretty
> reliably in the first 2000 tasks, often much sooner).
>
> Exception in thread "DAGScheduler" org.apache.spark.SparkException:
> EOF reached before Python server acknowledged
> at
> org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:340)
> at
> org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:311)
> at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:70)
> at
> org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:253)
> at
> org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:251)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> at scala.collection.Iterator$class.foreach(Iterator.scala:772)
> at scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
> at org.apache.spark.Accumulators$.add(Accumulators.scala:251)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:662)
> at
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:437)
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
> at
> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)
>
> I'm running the following software versions on all machines:
> Spark: 0.8.1  (md5: 5d3c56eaf91c7349886d5c70439730b3)
> Mesos: 0.13.0  (md5: 220dc9c1db118bc7599d45631da578b9)
> Python 2.7.3 (Stackoverflow mentioned differing python versions may be
> to blame --- unless Spark or iPython is specifically invoking an older
> version under the hood mine are all the same).
> Ubuntu 12.0.4
>
> I've modified mesos-daemon.sh as follows:
> I had problems launching the cluster with mesos-start-cluster.sh and
> traced the problem to (what seemed to be) a bug in mesos-daemon.sh
> which used a "--conf" flag that mesos-slave and mesos-master didn't
> recognize.  I removed the flag and instead added code to read in
> environment variables from mesos-deploy-env.sh.
> mesos-start-cluster.sh then worked as advertised.
>
> Incase it's helpful, I've attached several files as follows:
> *spark_full_output: output of ipython process where SparkContext was
> created
> *mesos-deploy-env.sh: mesos config file from slave (identical to
> master except for MESOS_MASTER)
> *spark-env.sh: spark config file
> *mesos-master.INFO: log file from mesos-master
> *mesos-master.WARNING: log file from mesos-master
> *mesos-daemon.sh: my modified version of mesos-daemon.sh
>
> Incase anybody from Berkeley is so interested they want to interact
> with my deployment, my office is in Soda hall so that can definitely
> be arranged.  My apologies if anybody received a duplicate message; I
> encountered some delays and complication while joining the list.
>
> -Brad Miller
>


Re: Missing Spark URL after staring the master

2014-03-03 Thread Bin Wang
Hi Ognen/Mayur,

Thanks for the reply and it is good to know how easy it is to setup Spark
on AWS cluster.

My situation is a bit different from yours, our company already have a
cluster and it really doesn't make that much sense not to use them. That is
why I have been "going through" this. I really wish there are some
tutorials teaching you how to set up Spark Cluster on baremetal CDH cluster
or .. some way to tweak the CDH Spark distribution, so it is up to date.

Ognen, of course it will be very helpful if you can 'history | grep
spark... ' and document the work that you have done since you've already
made it!

Bin



On Mon, Mar 3, 2014 at 2:06 PM, Ognen Duzlevski  wrote:

>  I should add that in this setup you really do not need to look for the
> printout of the master node's IP - you set it yourself a priori. If anyone
> is interested, let me know, I can write it all up so that people can follow
> some set of instructions. Who knows, maybe I can come up with a set of
> scripts to automate it all...
>
> Ognen
>
>
>
> On 3/3/14, 3:02 PM, Ognen Duzlevski wrote:
>
> I have a Standalone spark cluster running in an Amazon VPC that I set up
> by hand. All I did was provision the machines from a common AMI image (my
> underlying distribution is Ubuntu), I created a "sparkuser" on each machine
> and I have a /home/sparkuser/spark folder where I downladed spark. I did
> this on the master only, I did sbt/sbt assemble and I set up the
> conf/spark-env.sh to point to the master which is an IP address (in my case
> 10.10.0.200, the port is the default 7077). I also set up the slaves file
> in the same subdirectory to have all 16 ip addresses of the worker nodes
> (in my case 10.10.0.201-216). After sbt/sbt assembly was done on master, I
> then did cd ~/; tar -czf spark.tgz spark/ and I copied the resulting tgz
> file to each worker using the same "sparkuser" account and unpacked the
> .tgz on each slave (this will effectively replicate everything from master
> to all slaves - you can script this so you don't do it by hand).
>
> Your AMI should have the distribution's version of Java and git installed
> by the way.
>
> All you have to do then is sparkuser@spark-master>
> spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is spark/bin/start-all.sh)
> and it will all automagically start :)
>
> All my Amazon nodes come with 4x400 Gb of ephemeral space which I have set
> up into a 1.6TB RAID0 array on each node and I am pooling this into an HDFS
> filesystem which is operated by a namenode outside the spark cluster while
> all the datanodes are the same nodes as the spark workers. This enables
> replication and extremely fast access since ephemeral is much faster than
> EBS or anything else on Amazon (you can do even better with SSD drives on
> this setup but it will cost ya).
>
> If anyone is interested I can document our pipeline set up - I came up
> with it myself and do not have a clue as to what the industry standards are
> since I could not find any written instructions anywhere online about how
> to set up a whole data analytics pipeline from the point of ingestion to
> the point of analytics (people don't want to share their secrets? or am I
> just in the dark and incapable of using Google properly?). My requirement
> was that I wanted this to run within a VPC for added security and
> simplicity, the Amazon security groups get really old quickly. Added bonus
> is that you can use a VPN as an entry into the whole system and your
> cluster instantly becomes "local" to you in terms of IPs etc. I use OpenVPN
> since I don't like Cisco nor Juniper (the only two options Amazon provides
> for their VPN gateways).
>
> Ognen
>
>
> On 3/3/14, 1:00 PM, Bin Wang wrote:
>
> Hi there,
>
>  I have a CDH cluster set up, and I tried using the Spark parcel come
> with Cloudera Manager, but it turned out they even don't have the
> run-example shell command in the bin folder. Then I removed it from the
> cluster and cloned the incubator-spark into the name node of my cluster,
> and built from source there successfully with everything as default.
>
>  I ran a few examples and everything seems work fine in the local mode.
> Then I am thinking about scale it to my cluster, which is what the
> "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to add all
> the datanodes to the slaves and think I should run Spark in the standalone
> mode.
>
>  Say I am trying to set up Spark in the standalone mode following this
> instruction:
> https://spark.incubator.apache.org/docs/latest/spark-standalone.html
> However, it says "Once started, the master will print out a
> spark://HOST:PORT URL for itself, which you can use to connect workers to
> it, or pass as the "master" argument to SparkContext. You can also find
> this URL on the master's web UI, which is http://localhost:8080 by
> default."
>
>  After I started the master, there is no URL printed on the screen and
> neither the web UI is running.
> Here is the output:
>  [root@box 

Re: filter operation in pyspark

2014-03-03 Thread Mayur Rustagi
Could be a number of issues.. maybe your csv is not allowing map tasks to
be broken, of the file is not process-node local.. how many tasks are you
seeing in spark web ui for map & store data. are all the nodes being used
when you look at task level .. is the time taken by each task roughly equal
or very skewed...
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Mon, Mar 3, 2014 at 4:13 PM, Mohit Singh  wrote:

> Hi,
>I have a csv file... (say "n" columns )
>
> I am trying to do a filter operation like:
>
> query = rdd.filter(lambda x:x[1] == "1234")
> query.take(20)
> Basically this would return me rows with that specific value?
> This manipulation is taking quite some time to execute.. (if i can
> compare.. maybe slower than hadoop operation..)
>
> I am seeing this on my console:
> 14/03/03 16:13:03 INFO PythonRDD: Times: total = 5245, boot = 3, init = 8,
> finish = 5234
> 14/03/03 16:13:03 INFO SparkContext: Job finished: take at :1, took
> 5.249082169 s
> 14/03/03 16:13:03 INFO SparkContext: Starting job: take at :1
> 14/03/03 16:13:03 INFO DAGScheduler: Got job 715 (take at :1) with
> 1 output partitions (allowLocal=true)
> 14/03/03 16:13:03 INFO DAGScheduler: Final stage: Stage 720 (take at
> :1)
> 14/03/03 16:13:03 INFO DAGScheduler: Parents of final stage: List()
> 14/03/03 16:13:03 INFO DAGScheduler: Missing parents: List()
> 14/03/03 16:13:03 INFO DAGScheduler: Computing the requested partition
> locally
> 14/03/03 16:13:03 INFO HadoopRDD: Input split:
> hdfs://master:9000/user/hadoop/data/input.csv:5100273664+134217728
>
> Am I not doing this correctly?
>
>
> --
> Mohit
>
> "When you want success as badly as you want the air, then you will get it.
> There is no other secret of success."
> -Socrates
>


Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Mayur Rustagi
+1


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Mon, Mar 3, 2014 at 4:10 PM, polkosity  wrote:

> Thats exciting!  Will be looking into that, thanks Andrew.
>
> Related topic, has anyone had any experience running Spark on Tachyon
> in-memory filesystem, and could offer their views on using it?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


filter operation in pyspark

2014-03-03 Thread Mohit Singh
Hi,
   I have a csv file... (say "n" columns )

I am trying to do a filter operation like:

query = rdd.filter(lambda x:x[1] == "1234")
query.take(20)
Basically this would return me rows with that specific value?
This manipulation is taking quite some time to execute.. (if i can
compare.. maybe slower than hadoop operation..)

I am seeing this on my console:
14/03/03 16:13:03 INFO PythonRDD: Times: total = 5245, boot = 3, init = 8,
finish = 5234
14/03/03 16:13:03 INFO SparkContext: Job finished: take at :1, took
5.249082169 s
14/03/03 16:13:03 INFO SparkContext: Starting job: take at :1
14/03/03 16:13:03 INFO DAGScheduler: Got job 715 (take at :1) with 1
output partitions (allowLocal=true)
14/03/03 16:13:03 INFO DAGScheduler: Final stage: Stage 720 (take at
:1)
14/03/03 16:13:03 INFO DAGScheduler: Parents of final stage: List()
14/03/03 16:13:03 INFO DAGScheduler: Missing parents: List()
14/03/03 16:13:03 INFO DAGScheduler: Computing the requested partition
locally
14/03/03 16:13:03 INFO HadoopRDD: Input split:
hdfs://master:9000/user/hadoop/data/input.csv:5100273664+134217728

Am I not doing this correctly?


-- 
Mohit

"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates


Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread polkosity
Thats exciting!  Will be looking into that, thanks Andrew.

Related topic, has anyone had any experience running Spark on Tachyon
in-memory filesystem, and could offer their views on using it? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2265.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Andrew Ash
polkosity, have you seen the job server that Ooyala open sourced?  I think
it's very similar to what you're proposing with a REST API and re-using a
SparkContext.

https://github.com/apache/incubator-spark/pull/222
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server


On Mon, Mar 3, 2014 at 3:30 PM, polkosity  wrote:

> We're thinking of creating a Spark job server with a REST API, which would
> enable us (as well as managing jobs) to re-use the spark context as you
> suggest.  Thanks Koert!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2263.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread polkosity
We're thinking of creating a Spark job server with a REST API, which would
enable us (as well as managing jobs) to re-use the spark context as you
suggest.  Thanks Koert!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2263.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Sandy Ryza
Are you running in yarn-standalone mode or yarn-client mode?  Also, what
YARN scheduler and what NodeManager heartbeat?


On Sun, Mar 2, 2014 at 9:41 PM, polkosity  wrote:

> Thanks for the advice Mayur.
>
> I thought I'd report back on the performance difference...  Spark
> standalone
> mode has executors processing at capacity in under a second :)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Missing Spark URL after staring the master

2014-03-03 Thread Ognen Duzlevski
I should add that in this setup you really do not need to look for the 
printout of the master node's IP - you set it yourself a priori. If 
anyone is interested, let me know, I can write it all up so that people 
can follow some set of instructions. Who knows, maybe I can come up with 
a set of scripts to automate it all...


Ognen


On 3/3/14, 3:02 PM, Ognen Duzlevski wrote:
I have a Standalone spark cluster running in an Amazon VPC that I set 
up by hand. All I did was provision the machines from a common AMI 
image (my underlying distribution is Ubuntu), I created a "sparkuser" 
on each machine and I have a /home/sparkuser/spark folder where I 
downladed spark. I did this on the master only, I did sbt/sbt assemble 
and I set up the conf/spark-env.sh to point to the master which is an 
IP address (in my case 10.10.0.200, the port is the default 7077). I 
also set up the slaves file in the same subdirectory to have all 16 ip 
addresses of the worker nodes (in my case 10.10.0.201-216). After 
sbt/sbt assembly was done on master, I then did cd ~/; tar -czf 
spark.tgz spark/ and I copied the resulting tgz file to each worker 
using the same "sparkuser" account and unpacked the .tgz on each slave 
(this will effectively replicate everything from master to all slaves 
- you can script this so you don't do it by hand).


Your AMI should have the distribution's version of Java and git 
installed by the way.


All you have to do then is sparkuser@spark-master> 
spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is 
spark/bin/start-all.sh) and it will all automagically start :)


All my Amazon nodes come with 4x400 Gb of ephemeral space which I have 
set up into a 1.6TB RAID0 array on each node and I am pooling this 
into an HDFS filesystem which is operated by a namenode outside the 
spark cluster while all the datanodes are the same nodes as the spark 
workers. This enables replication and extremely fast access since 
ephemeral is much faster than EBS or anything else on Amazon (you can 
do even better with SSD drives on this setup but it will cost ya).


If anyone is interested I can document our pipeline set up - I came up 
with it myself and do not have a clue as to what the industry 
standards are since I could not find any written instructions anywhere 
online about how to set up a whole data analytics pipeline from the 
point of ingestion to the point of analytics (people don't want to 
share their secrets? or am I just in the dark and incapable of using 
Google properly?). My requirement was that I wanted this to run within 
a VPC for added security and simplicity, the Amazon security groups 
get really old quickly. Added bonus is that you can use a VPN as an 
entry into the whole system and your cluster instantly becomes "local" 
to you in terms of IPs etc. I use OpenVPN since I don't like Cisco nor 
Juniper (the only two options Amazon provides for their VPN gateways).


Ognen


On 3/3/14, 1:00 PM, Bin Wang wrote:

Hi there,

I have a CDH cluster set up, and I tried using the Spark parcel come 
with Cloudera Manager, but it turned out they even don't have the 
run-example shell command in the bin folder. Then I removed it from 
the cluster and cloned the incubator-spark into the name node of my 
cluster, and built from source there successfully with everything as 
default.


I ran a few examples and everything seems work fine in the local 
mode. Then I am thinking about scale it to my cluster, which is what 
the "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want 
to add all the datanodes to the slaves and think I should run Spark 
in the standalone mode.


Say I am trying to set up Spark in the standalone mode following this 
instruction:

https://spark.incubator.apache.org/docs/latest/spark-standalone.html
However, it says "Once started, the master will print out a 
|spark://HOST:PORT| URL for itself, which you can use to connect 
workers to it, or pass as the "master" argument to |SparkContext|. 
You can also find this URL on the master's web UI, which is 
http://localhost:8080  by default."


After I started the master, there is no URL printed on the screen and 
neither the web UI is running.

Here is the output:
[root@box incubator-spark]# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to 
/root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out


First Question: am I even in the ballpark to run Spark in standalone 
mode if I try to fully utilize my cluster? I saw there are four ways 
to launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache 
Meso, Hadoop Yarn... which I guess standalone mode is the way to go?


Second Question: how to get the Spark URL of the cluster, why the 
output is not like what the instruction says?


Best regards,

Bin


--
Some people, when confronted with a problem, think "I know, I'll use regular 
expressions." Now they have two problems.
-- Jami

Re: Missing Spark URL after staring the master

2014-03-03 Thread Ognen Duzlevski
I have a Standalone spark cluster running in an Amazon VPC that I set up 
by hand. All I did was provision the machines from a common AMI image 
(my underlying distribution is Ubuntu), I created a "sparkuser" on each 
machine and I have a /home/sparkuser/spark folder where I downladed 
spark. I did this on the master only, I did sbt/sbt assemble and I set 
up the conf/spark-env.sh to point to the master which is an IP address 
(in my case 10.10.0.200, the port is the default 7077). I also set up 
the slaves file in the same subdirectory to have all 16 ip addresses of 
the worker nodes (in my case 10.10.0.201-216). After sbt/sbt assembly 
was done on master, I then did cd ~/; tar -czf spark.tgz spark/ and I 
copied the resulting tgz file to each worker using the same "sparkuser" 
account and unpacked the .tgz on each slave (this will effectively 
replicate everything from master to all slaves - you can script this so 
you don't do it by hand).


Your AMI should have the distribution's version of Java and git 
installed by the way.


All you have to do then is sparkuser@spark-master> 
spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is spark/bin/start-all.sh) 
and it will all automagically start :)


All my Amazon nodes come with 4x400 Gb of ephemeral space which I have 
set up into a 1.6TB RAID0 array on each node and I am pooling this into 
an HDFS filesystem which is operated by a namenode outside the spark 
cluster while all the datanodes are the same nodes as the spark workers. 
This enables replication and extremely fast access since ephemeral is 
much faster than EBS or anything else on Amazon (you can do even better 
with SSD drives on this setup but it will cost ya).


If anyone is interested I can document our pipeline set up - I came up 
with it myself and do not have a clue as to what the industry standards 
are since I could not find any written instructions anywhere online 
about how to set up a whole data analytics pipeline from the point of 
ingestion to the point of analytics (people don't want to share their 
secrets? or am I just in the dark and incapable of using Google 
properly?). My requirement was that I wanted this to run within a VPC 
for added security and simplicity, the Amazon security groups get really 
old quickly. Added bonus is that you can use a VPN as an entry into the 
whole system and your cluster instantly becomes "local" to you in terms 
of IPs etc. I use OpenVPN since I don't like Cisco nor Juniper (the only 
two options Amazon provides for their VPN gateways).


Ognen


On 3/3/14, 1:00 PM, Bin Wang wrote:

Hi there,

I have a CDH cluster set up, and I tried using the Spark parcel come 
with Cloudera Manager, but it turned out they even don't have the 
run-example shell command in the bin folder. Then I removed it from 
the cluster and cloned the incubator-spark into the name node of my 
cluster, and built from source there successfully with everything as 
default.


I ran a few examples and everything seems work fine in the local mode. 
Then I am thinking about scale it to my cluster, which is what the 
"DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to 
add all the datanodes to the slaves and think I should run Spark in 
the standalone mode.


Say I am trying to set up Spark in the standalone mode following this 
instruction:

https://spark.incubator.apache.org/docs/latest/spark-standalone.html
However, it says "Once started, the master will print out a 
|spark://HOST:PORT| URL for itself, which you can use to connect 
workers to it, or pass as the "master" argument to |SparkContext|. You 
can also find this URL on the master's web UI, which is 
http://localhost:8080  by default."


After I started the master, there is no URL printed on the screen and 
neither the web UI is running.

Here is the output:
[root@box incubator-spark]# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to 
/root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out


First Question: am I even in the ballpark to run Spark in standalone 
mode if I try to fully utilize my cluster? I saw there are four ways 
to launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache 
Meso, Hadoop Yarn... which I guess standalone mode is the way to go?


Second Question: how to get the Spark URL of the cluster, why the 
output is not like what the instruction says?


Best regards,

Bin


--
Some people, when confronted with a problem, think "I know, I'll use regular 
expressions." Now they have two problems.
-- Jamie Zawinski



Re: Missing Spark URL after staring the master

2014-03-03 Thread Mayur Rustagi
I think you have been through enough :).
Basically you have to download spark-ec2 scripts & run them. It'll just
need your amazon secret key & access key, start your cluster, install
everything, create security groups & give you the url, just login & go
ahead...

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Mon, Mar 3, 2014 at 11:00 AM, Bin Wang  wrote:

> Hi there,
>
> I have a CDH cluster set up, and I tried using the Spark parcel come with
> Cloudera Manager, but it turned out they even don't have the run-example
> shell command in the bin folder. Then I removed it from the cluster and
> cloned the incubator-spark into the name node of my cluster, and built from
> source there successfully with everything as default.
>
> I ran a few examples and everything seems work fine in the local mode.
> Then I am thinking about scale it to my cluster, which is what the
> "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to add all
> the datanodes to the slaves and think I should run Spark in the standalone
> mode.
>
> Say I am trying to set up Spark in the standalone mode following this
> instruction:
> https://spark.incubator.apache.org/docs/latest/spark-standalone.html
> However, it says "Once started, the master will print out a
> spark://HOST:PORT URL for itself, which you can use to connect workers to
> it, or pass as the “master” argument to SparkContext. You can also find
> this URL on the master’s web UI, which is http://localhost:8080 by
> default."
>
> After I started the master, there is no URL printed on the screen and
> neither the web UI is running.
> Here is the output:
> [root@box incubator-spark]# ./sbin/start-master.sh
> starting org.apache.spark.deploy.master.Master, logging to
> /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out
>
> First Question: am I even in the ballpark to run Spark in standalone mode
> if I try to fully utilize my cluster? I saw there are four ways to launch
> Spark on a cluster, AWS-EC2, Spark in standalone, Apache Meso, Hadoop
> Yarn... which I guess standalone mode is the way to go?
>
> Second Question: how to get the Spark URL of the cluster, why the output
> is not like what the instruction says?
>
> Best regards,
>
> Bin
>


o.a.s.u.Vector instances for equality

2014-03-03 Thread Oleksandr Olgashko
Hello. How should i better check two Vector's for equality?

val a = new Vector(Array(1))
val b = new Vector(Array(1))
println(a == b)
// false


Missing Spark URL after staring the master

2014-03-03 Thread Bin Wang
Hi there,

I have a CDH cluster set up, and I tried using the Spark parcel come with
Cloudera Manager, but it turned out they even don't have the run-example
shell command in the bin folder. Then I removed it from the cluster and
cloned the incubator-spark into the name node of my cluster, and built from
source there successfully with everything as default.

I ran a few examples and everything seems work fine in the local mode. Then
I am thinking about scale it to my cluster, which is what the "DISTRIBUTE +
ACTIVATE" command does in Cloudera Manager. I want to add all the datanodes
to the slaves and think I should run Spark in the standalone mode.

Say I am trying to set up Spark in the standalone mode following this
instruction:
https://spark.incubator.apache.org/docs/latest/spark-standalone.html
However, it says "Once started, the master will print out a
spark://HOST:PORT URL for itself, which you can use to connect workers to
it, or pass as the "master" argument to SparkContext. You can also find
this URL on the master's web UI, which is http://localhost:8080 by default."

After I started the master, there is no URL printed on the screen and
neither the web UI is running.
Here is the output:
[root@box incubator-spark]# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to
/root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out

First Question: am I even in the ballpark to run Spark in standalone mode
if I try to fully utilize my cluster? I saw there are four ways to launch
Spark on a cluster, AWS-EC2, Spark in standalone, Apache Meso, Hadoop
Yarn... which I guess standalone mode is the way to go?

Second Question: how to get the Spark URL of the cluster, why the output is
not like what the instruction says?

Best regards,

Bin


pyspark crash on mesos

2014-03-03 Thread bmiller1
Hi All,

After switching from standalone Spark to Mesos I'm experiencing some
instability.  I'm running pyspark interactively through iPython notebook,
and get this crash non-deterministically (although pretty reliably in the
first 2000 tasks, often much sooner).

Exception in thread "DAGScheduler" org.apache.spark.SparkException: EOF
reached before Python server acknowledged
at
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:340)
at
org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:311)
at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:70)
at
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:253)
at
org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:251)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at 
scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
at org.apache.spark.Accumulators$.add(Accumulators.scala:251)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:662)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:437)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)

I'm running the following software versions on all machines:
Spark: 0.8.1  (md5: 5d3c56eaf91c7349886d5c70439730b3)
Mesos: 0.13.0  (md5: 220dc9c1db118bc7599d45631da578b9)
Python 2.7.3 (Stackoverflow mentioned differing python versions may be to
blame --- unless Spark or iPython is specifically invoking an older version
under the hood mine are all the same).
Ubuntu 12.0.4

I've modified mesos-daemon.sh as follows:
I had problems launching the cluster with mesos-start-cluster.sh and traced
the problem to (what seemed to be) a bug in mesos-daemon.sh which used a
"--conf" flag that mesos-slave and mesos-master didn't recognize.  I removed
the flag and instead added code to read in environment variables from
mesos-deploy-env.sh.  mesos-start-cluster.sh then worked as advertised.

Incase it's helpful, I've inclucded several files as follows:
* spark_full_output

 
: output of ipython process where SparkContext was created
* mesos-deploy-env.sh

 
: mesos config file from slave (identical to master except for MESOS_MASTER)
* spark-env.sh
 
: spark config file
* mesos-master.INFO

 
: log file from mesos-master
* mesos-master.WARNING

 
: log file from mesos-master
* mesos-daemon.sh

 
: my modified version of mesos-daemon.sh

Incase anybody from Berkeley is so interested they want to interact with my
deployment, my office is in Soda hall so that can definitely be arranged.

-Brad Miller



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-crash-on-mesos-tp2255.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Blog : Why Apache Spark is a Crossover Hit for Data Scientists

2014-03-03 Thread Sean Owen
I put together a little opinion piece on why Spark is cool for data
science. There is, I think, a nice example of using ALS with Stack
Overflow tags in here too. Hope Spark folks out there might enjoy...

http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

Sean


Re: Job initialization performance of Spark standalone mode vs YARN

2014-03-03 Thread Koert Kuipers
If you need quick response re-use your spark context between queries and
cache rdds in memory
On Mar 3, 2014 12:42 AM, "polkosity"  wrote:

> Thanks for the advice Mayur.
>
> I thought I'd report back on the performance difference...  Spark
> standalone
> mode has executors processing at capacity in under a second :)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Beginners Hadoop question

2014-03-03 Thread goi cto
Thanks. I will try it!


On Mon, Mar 3, 2014 at 1:19 PM, Alonso Isidoro Roman wrote:

> Hi, i am a beginner too, but as i have learned, hadoop works better with
> big files, at least with 64MB, 128MB or even more. I think you need to
> aggregate all the files into a new big one. Then you must copy to HDFS
> using this command:
>
> hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE
>
> hadoop just copy MYFILE into hadoop distributed file system.
>
> Can i recommend you what i have done? go to BigDataUniversity.com and take
> the Hadoop Fundamentals I course. It is free and very well documented.
>
> Regards
>
> Alonso Isidoro Roman.
>
> Mis citas preferidas (de hoy) :
> "Si depurar es el proceso de quitar los errores de software, entonces
> programar debe ser el proceso de introducirlos..."
>  -  Edsger Dijkstra
>
> My favorite quotes (today):
> "If debugging is the process of removing software bugs, then programming
> must be the process of putting ..."
>   - Edsger Dijkstra
>
> "If you pay peanuts you get monkeys"
>
>
>
> 2014-03-03 12:10 GMT+01:00 goi cto :
>
> Hi,
>>
>> I am sorry for the beginners question but...
>> I have a spark java code which reads a file (c:\my-input.csv) process it
>> and writes an output file (my-output.csv)
>> Now I want to run it on Hadoop in a distributed environment
>> 1) My inlut file should be one big file or separate smaller files?
>> 2) if we are using smaller files, how does my code needs to change to
>> process all of the input files?
>>
>> Will Hadoop just copy the files to different servers or will it also
>> split their content among servers?
>>
>> Any example will be great!
>> --
>> Eran | CTO
>>
>
>


-- 
Eran | CTO


Re: Beginners Hadoop question

2014-03-03 Thread Mohit Singh
Not sure whether I understand your question correctly or not?
If you are trying to use hadoop ( as in map reduce programming model), then
basically you would have to use hadoop api's to solve your program.
But if you have data stored in hdfs, and you want to use spark to process
that data, then just specify the input path as spark.textFile("hdfs://...")

Take a look at these examples:
http://spark.incubator.apache.org/examples.html



On Mon, Mar 3, 2014 at 3:19 AM, Alonso Isidoro Roman wrote:

> Hi, i am a beginner too, but as i have learned, hadoop works better with
> big files, at least with 64MB, 128MB or even more. I think you need to
> aggregate all the files into a new big one. Then you must copy to HDFS
> using this command:
>
> hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE
>
> hadoop just copy MYFILE into hadoop distributed file system.
>
> Can i recommend you what i have done? go to BigDataUniversity.com and take
> the Hadoop Fundamentals I course. It is free and very well documented.
>
> Regards
>
> Alonso Isidoro Roman.
>
> Mis citas preferidas (de hoy) :
> "Si depurar es el proceso de quitar los errores de software, entonces
> programar debe ser el proceso de introducirlos..."
>  -  Edsger Dijkstra
>
> My favorite quotes (today):
> "If debugging is the process of removing software bugs, then programming
> must be the process of putting ..."
>   - Edsger Dijkstra
>
> "If you pay peanuts you get monkeys"
>
>
>
> 2014-03-03 12:10 GMT+01:00 goi cto :
>
> Hi,
>>
>> I am sorry for the beginners question but...
>> I have a spark java code which reads a file (c:\my-input.csv) process it
>> and writes an output file (my-output.csv)
>> Now I want to run it on Hadoop in a distributed environment
>> 1) My inlut file should be one big file or separate smaller files?
>> 2) if we are using smaller files, how does my code needs to change to
>> process all of the input files?
>>
>> Will Hadoop just copy the files to different servers or will it also
>> split their content among servers?
>>
>> Any example will be great!
>> --
>> Eran | CTO
>>
>
>


-- 
Mohit

"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates


Re: Beginners Hadoop question

2014-03-03 Thread Alonso Isidoro Roman
Hi, i am a beginner too, but as i have learned, hadoop works better with
big files, at least with 64MB, 128MB or even more. I think you need to
aggregate all the files into a new big one. Then you must copy to HDFS
using this command:

hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE

hadoop just copy MYFILE into hadoop distributed file system.

Can i recommend you what i have done? go to BigDataUniversity.com and take
the Hadoop Fundamentals I course. It is free and very well documented.

Regards

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"



2014-03-03 12:10 GMT+01:00 goi cto :

> Hi,
>
> I am sorry for the beginners question but...
> I have a spark java code which reads a file (c:\my-input.csv) process it
> and writes an output file (my-output.csv)
> Now I want to run it on Hadoop in a distributed environment
> 1) My inlut file should be one big file or separate smaller files?
> 2) if we are using smaller files, how does my code needs to change to
> process all of the input files?
>
> Will Hadoop just copy the files to different servers or will it also split
> their content among servers?
>
> Any example will be great!
> --
> Eran | CTO
>


Beginners Hadoop question

2014-03-03 Thread goi cto
Hi,

I am sorry for the beginners question but...
I have a spark java code which reads a file (c:\my-input.csv) process it
and writes an output file (my-output.csv)
Now I want to run it on Hadoop in a distributed environment
1) My inlut file should be one big file or separate smaller files?
2) if we are using smaller files, how does my code needs to change to
process all of the input files?

Will Hadoop just copy the files to different servers or will it also split
their content among servers?

Any example will be great!
-- 
Eran | CTO


Problem with "delete spark temp dir" on spark 0.8.1

2014-03-03 Thread goi cto
Hi,

I am running a spark java program on a local machine. when I try to write
the output to a file (RDD.SaveAsTextFile) I am getting this exception:

Exception in thread "Delete Spark temp dir ..."

This is running on my local window machine.

Any ideas?

-- 
Eran | CTO


Error: Could not find or load main class org.apache.spark.repl.Main on GitBash

2014-03-03 Thread goi cto
Hi,

I am trying to run Spark-shell on GitBash on windows with Spark 0.9
I am getting "*Error: Could not find or load main class
org.apache.spark.repl.Main*"

I tried running sbt/sbt clean assembly which completed successfully but the
problem still exist.

Any other ideas?
Which path variables should I make sure I have?

-- 
Eran | CTO


Re: OutOfMemoryError when loading input file

2014-03-03 Thread Yonathan Perez
Thanks for your answer yxzhao, but setting SPARK_MEM doesn't solve the
problem. 
I also understand that setting SPARK_MEM is the same as calling
SparkConf.set("spark.executor.memory",..) which I do.

Any additional advice would be highly appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-when-loading-input-file-tp2213p2246.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.