date:20160118

Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-18 Thread doctorx

Hi,
I am facing the same issue, with the given error

ERROR Master:75 - Leadership has been revoked -- master shutting down.

Can anybody help. Any clue will be useful. Should i change something in
spark cluster or zookeeper. Is there any setting in spark which can help me?

Thanks & Regards
Raghvendra



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-0-standalone-ha-zookeeper-tp21308p25994.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Calling SparkContext methods in scala Future

2016-01-18 Thread Marco

Hello,

I am using Spark 1.5.1 within SBT, and Scala 2.10.6 and I am facing an
issue with the SparkContext.

Basically, I have an object that needs to do several things:

- call an external service One (web api)
- call an external service Two (another api)
- read and produce an RDD from HDFS (Spark)
- parallelize the data obtained in the first two calls
- join these different rdds, do stuff with them...

Now, I am trying to do it in an asynchronous way. This doesn't seem to
work, though. My guess is that Spark doesn't see the calls to .parallelize,
as they are made in different tasks (or Future, therefore this code is
called before/later or maybe with an unset Context (can it be?)). I have
tried different ways, one of these being the call to SparkEnv.set in the
calls to flatMap and map (in the Future). However, all I get is Cannot call
methods on a stopped SparkContext. It just doesnt'work - maybe I just
misunderstood what it does, therefore I removed it.

This is the code I have written so far:

object Fetcher {

  def fetch(name, master, ...) = {
val externalCallOne: Future[WSResponse] = externalService1()
val externalCallTwo: Future[String] = externalService2()
// val sparkEnv = SparkEnv.get
val config = new SparkConf()
.setAppName(name)
.set("spark.master", master)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

val sparkContext = new SparkContext(config)
//val sparkEnv = SparkEnv.get

val eventuallyJoinedData = externalCallOne flatMap { dataOne =>
  // SparkEnv.set(sparkEnv)
  externalCallTwo map { dataTwo =>
println("in map") // prints, so it gets here ...
val rddOne = sparkContext.parallelize(dataOne)
val rddTwo = sparkContext.parallelize(dataTwo)
// do stuff here ... foreach/println, and

val joinedData = rddOne leftOuterJoin (rddTwo)
  }
}
eventuallyJoinedData onSuccess { case success => ...  }
eventuallyJoinedData onFailure { case error =>
println(error.getMessage) }
// sparkContext.stop
  }

}
As you can see, I have also tried to comment the line to stop the context,
but then I get another issue:

13:09:14.929 [ForkJoinPool-1-worker-5] INFO  org.apache.spark.SparkContext
- Starting job: count at Fetcher.scala:38
13:09:14.932 [shuffle-server-0] DEBUG io.netty.channel.nio.NioEventLoop -
Selector.select() returned prematurely because
Thread.currentThread().interrupt() was called. Use
NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
13:09:14.936 [Spark Context Cleaner] ERROR org.apache.spark.ContextCleaner
- Error in cleaning thread
java.lang.InterruptedException: null
at java.lang.Object.wait(Native Method) ~[na:1.8.0_65]
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
~[na:1.8.0_65]
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:157)
~[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
[spark-core_2.10-1.5.1.jar:1.5.1]
at 
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154)
[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67)
[spark-core_2.10-1.5.1.jar:1.5.1]
13:09:14.940 [db-async-netty-thread-1] DEBUG
io.netty.channel.nio.NioEventLoop - Selector.select() returned prematurely
because Thread.currentThread().interrupt() was called. Use
NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
13:09:14.943 [SparkListenerBus] ERROR org.apache.spark.util.Utils -
uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.InterruptedException: null
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
~[na:1.8.0_65]
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
~[na:1.8.0_65]
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
~[na:1.8.0_65]
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:65)
~[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
~[spark-core_2.10-1.5.1.jar:1.5.1]
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
[spark-core_2.10-1.5.1.jar:1.5.1]
13:09:14.949 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle -
stopping org.spark-project.jetty.server.Server@787cbcef
13:09:14.959 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle -
stopping SelectChannelConnector@0.0.0.0:4040
13:09:14.959 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle -
stopping
org.spark-project.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager@797cc465
As you can see, it tries to call the count

Re: SparkContext SyntaxError: invalid syntax

2016-01-18 Thread Andrew Weiner

Hi Felix,

Yeah, when I try to build the docs using jekyll build, I get a LoadError
(cannot load such file -- pygments) and I'm having trouble getting past it
at the moment.

>From what I could tell, this does not apply to YARN in client mode.  I was
able to submit jobs in client mode and they would run fine without using
the appMasterEnv property.  I even confirmed that my environment variables
persisted during the job when run in client mode.  There is something about
YARN cluster mode that uses a different environment (the YARN Application
Master environment) and requires the appMasterEnv property for setting
environment variables.

On Sun, Jan 17, 2016 at 11:37 PM, Felix Cheung 
wrote:

> Do you still need help on the PR?
> btw, does this apply to YARN client mode?
>
> --
> From: andrewweiner2...@u.northwestern.edu
> Date: Sun, 17 Jan 2016 17:00:39 -0600
> Subject: Re: SparkContext SyntaxError: invalid syntax
> To: cutl...@gmail.com
> CC: user@spark.apache.org
>
>
> Yeah, I do think it would be worth explicitly stating this in the docs.  I
> was going to try to edit the docs myself and submit a pull request, but I'm
> having trouble building the docs from github.  If anyone else wants to do
> this, here is approximately what I would say:
>
> (To be added to
> http://spark.apache.org/docs/latest/configuration.html#environment-variables
> )
> "Note: When running Spark on YARN in cluster mode, environment variables
> need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]
> property in your conf/spark-defaults.conf file.  Environment variables
> that are set in spark-env.sh will not be reflected in the YARN
> Application Master process in cluster mode.  See the YARN-related Spark
> Properties
> 
> for more information."
>
> I might take another crack at building the docs myself if nobody beats me
> to this.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler  wrote:
>
> Glad you got it going!  It's wasn't very obvious what needed to be set,
> maybe it is worth explicitly stating this in the docs since it seems to
> have come up a couple times before too.
>
> Bryan
>
> On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> Actually, I just found this [
> https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of
> googling and reading leads me to believe that the preferred way to change
> the yarn environment is to edit the spark-defaults.conf file by adding this
> line:
> spark.yarn.appMasterEnv.PYSPARK_PYTHON/path/to/python
>
> While both this solution and the solution from my prior email work, I
> believe this is the preferred solution.
>
> Sorry for the flurry of emails.  Again, thanks for all the help!
>
> Andrew
>
> On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> I finally got the pi.py example to run in yarn cluster mode.  This was the
> key insight:
> https://issues.apache.org/jira/browse/SPARK-9229
>
> I had to set SPARK_YARN_USER_ENV in spark-env.sh:
> export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
>
> This caused the PYSPARK_PYTHON environment variable to be used in my yarn
> environment in cluster mode.
>
> Thank you for all your help!
>
> Best,
> Andrew
>
>
>
> On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> I tried playing around with my environment variables, and here is an
> update.
>
> When I run in cluster mode, my environment variables do not persist
> throughout the entire job.
> For example, I tried creating a local copy of HADOOP_CONF_DIR in
> /home//local/etc/hadoop/conf, and then, in spark-env.sh I the
> variable:
> export HADOOP_CONF_DIR=/home//local/etc/hadoop/conf
>
> Later, when we print the environment variables in the python code, I see
> this:
>
> ('HADOOP_CONF_DIR', '/etc/hadoop/conf')
>
> However, when I run in client mode, I see this:
>
> ('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
>
> Furthermore, if I omit that environment variable from spark-env.sh 
> altogether, I get the expected error in both client and cluster mode:
>
> When running with master 'yarn'
> either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
>
> This suggests that my environment variables are being used when I first 
> submit the job, but at some point during the job, my environment variables 
> are thrown out and someone's (yarn's?) environment variables are being used.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> Indeed!  Here is the output when I run in cluster mode:
>
> Traceback (most recent call last):
>   File "pi.py", line 22, in ?
> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
> RuntimeError:
> (2, 4, 3, 'final',

Calling SparkContext methods in scala Future

2016-01-18 Thread makronized

I am using Spark 1.5.1 within SBT, and Scala 2.10.6 and I am writing because
I am facing an issue with the SparkContext. 

Basically, I have an object that needs to do several things:

- call an external service One (web api)
- call an external service Two (another api)
- read and produce an RDD from HDFS (Spark)
- parallelize the data obtained in the first two calls
- join these different rdds, do stuff with them...

Now, I am trying to do it in an asynchronous way. This doesn't seem to work,
though. My guess is that Spark doesn't see the calls to .parallelize, as
they are made in different tasks (or Future, therefore this code is called
before/later or maybe with an unset Context (can it be?)). I have tried
different ways, one of these being the call to SparkEnv.set in the calls to
flatMap and map (in the Future). However, all I get is Cannot call methods
on a stopped SparkContext. It just doesnt'work - maybe I just misunderstood
what it does, therefore I removed it.

This is the code I have written so far:

object Fetcher { 

  def fetch(name, master, ...) = { 
val externalCallOne: Future[WSResponse] = externalService1()
val externalCallTwo: Future[String] = externalService2()
// val sparkEnv = SparkEnv.get
val config = new SparkConf()
.setAppName(name)
.set("spark.master", master)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

val sparkContext = new SparkContext(config)
//val sparkEnv = SparkEnv.get

val eventuallyJoinedData = externalCallOne flatMap { dataOne =>
  // SparkEnv.set(sparkEnv)
  externalCallTwo map { dataTwo =>
println("in map") // prints, so it gets here ...
val rddOne = sparkContext.parallelize(dataOne)
val rddTwo = sparkContext.parallelize(dataTwo)
// do stuff here ... foreach/println, and 

val joinedData = rddOne leftOuterJoin (rddTwo)
  }
} 
eventuallyJoinedData onSuccess { case success => ...  }
eventuallyJoinedData onFailure { case error => println(error.getMessage)
} 
// sparkContext.stop 
  } 

}
As you can see, I have also tried to comment the line to stop the context,
but then I get another issue:

13:09:14.929 [ForkJoinPool-1-worker-5] INFO  org.apache.spark.SparkContext -
Starting job: count at Fetcher.scala:38
13:09:14.932 [shuffle-server-0] DEBUG io.netty.channel.nio.NioEventLoop -
Selector.select() returned prematurely because
Thread.currentThread().interrupt() was called. Use
NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
13:09:14.936 [Spark Context Cleaner] ERROR org.apache.spark.ContextCleaner -
Error in cleaning thread
java.lang.InterruptedException: null
at java.lang.Object.wait(Native Method) ~[na:1.8.0_65]
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
~[na:1.8.0_65]
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:157)
~[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
[spark-core_2.10-1.5.1.jar:1.5.1]
at
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154)
[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67)
[spark-core_2.10-1.5.1.jar:1.5.1]
13:09:14.940 [db-async-netty-thread-1] DEBUG
io.netty.channel.nio.NioEventLoop - Selector.select() returned prematurely
because Thread.currentThread().interrupt() was called. Use
NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
13:09:14.943 [SparkListenerBus] ERROR org.apache.spark.util.Utils - uncaught
error in thread SparkListenerBus, stopping SparkContext
java.lang.InterruptedException: null
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
~[na:1.8.0_65]
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
~[na:1.8.0_65]
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
~[na:1.8.0_65]
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:65)
~[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
~[spark-core_2.10-1.5.1.jar:1.5.1]
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
[spark-core_2.10-1.5.1.jar:1.5.1]
13:09:14.949 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle -
stopping org.spark-project.jetty.server.Server@787cbcef
13:09:14.959 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle -
stopping SelectChannelConnector@0.0.0.0:4040
13:09:14.959 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle -
stopping
org.spark-project.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager@797cc465
As you can see, it

Re: Using JDBC clients with "Spark on Hive"

2016-01-18 Thread Ricardo Paiva

Are you running the Spark Thrift JDBC/ODBC server?

In my environment I have a Hive Metastore server and the Spark Thrift
Server pointing to the Hive Metastore.

I use the Hive beeline tool for testing. With this setup I'm able to use
Tableau connecting to Hive tables and using Spark SQL as the engine.

Regards,

Ricardo


On Thu, Jan 14, 2016 at 11:15 PM, sdevashis [via Apache Spark User List] <
ml-node+s1001560n25976...@n3.nabble.com> wrote:

> Hello Experts,
>
> I am getting started with Hive with Spark as the query engine. I built the
> package from sources. I am able to invoke Hive CLI and run queries and see
> in Ambari that Spark application are being created confirming hive is using
> Spark as the engine.
>
> However other than Hive CLI, I am not able to run queries from any other
> clients that use the JDBC to connect to hive through thrift. I tried
> Squirrel, Aginity Netezza workbench, and even Hue.
>
> No yarn applications are getting created, the query times out after
> sometime. Nothing gets into /tmp/user/hive.log Am I missing something?
>
> Again I am using Hive on Spark and not spark SQL.
>
> Version Info:
> Spark 1.4.1 built for Hadoop 2.4
>
>
> Thank you in advance for any pointers.
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-JDBC-clients-with-Spark-on-Hive-tp25976.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-JDBC-clients-with-Spark-on-Hive-tp25976p25988.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming: BatchDuration and Processing time

2016-01-18 Thread Ricardo Paiva

If you are using Kafka as the message queue, Spark will process accordingly
the time slices, even if it is late, like in your example. But it will fail
sometime, due the fact that your process will ask for a message that is
older than the oldest message in Kafka.
If your process takes longer than the streaming time, let's say at your
system peak time during day, but it takes much less time at night, when
your system is mostly idle, the streaming will work and process correctly
(though it's risky if the late time slices don't finish during the idle
time).

Best thing to do is try to optimize your job to fit at the time streaming
time and avoid overflows. :)

Regards,

Ricardo

On Sun, Jan 17, 2016 at 2:32 PM, pyspark2555 [via Apache Spark User List] <
ml-node+s1001560n25986...@n3.nabble.com> wrote:

> Hi,
>
> If BatchDuration is set to 1 second in StreamingContext and the actual
> processing time is longer than one second, then how does Spark handle that?
>
> For example, I am receiving a continuous Input stream. Every 1 second
> (batch duration), the RDDs will be processed. What if this processing time
> is longer than 1 second? What happens in the next batch duration?
>
> Thanks.
> Amit
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-BatchDuration-and-Processing-time-tp25986.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>

-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* 

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-BatchDuration-and-Processing-time-tp25986p25989.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark streaming: Fixed time aggregation & handling driver failures

2016-01-18 Thread Ricardo Paiva

I don't know if this is the most efficient way to do that, but you can use
a sliding window that is bigger than your aggregation period and filter
only for the messages inside the period.

Remember that to work with the reduceByKeyAndWindow you need to associate
each row with the time key, in your case "MMddhhmm".

http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

Hope it helps,

Regards,

Ricardo



On Sat, Jan 16, 2016 at 1:13 AM, ffarozan [via Apache Spark User List] <
ml-node+s1001560n25982...@n3.nabble.com> wrote:

> I am implementing aggregation using spark streaming and kafka. My batch
> and window size are same. And the aggregated data is persisted in
> Cassandra.
>
> I want to aggregate for fixed time windows - 5:00, 5:05, 5:10, ...
>
> But we cannot control when to run streaming job, we only get to specify
> the batch interval.
>
> So the problem is - lets say if streaming job starts at 5:02, then I will
> get results at 5:07, 5:12, etc. and not what I want.
>
> Any suggestions?
>
> thanks,
> Firdousi
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-Fixed-time-aggregation-handling-driver-failures-tp25982.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-Fixed-time-aggregation-handling-driver-failures-tp25982p25990.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Daniel Darabos

Hi,

How do you know it doesn't work? The log looks roughly normal to me. Is
Spark not running at the printed address? Can you not start jobs?

On Mon, Jan 18, 2016 at 11:51 AM, Oleg Ruchovets 
wrote:

> Hi ,
>I try to follow the spartk 1.6.0 to install spark on EC2.
>
> It doesn't work properly -  got exceptions and at the end standalone spark
> cluster installed.
>

The purpose of the script is to install a standalone Spark cluster. So
that's not an error :).


> here is log information:
>
> Any suggestions?
>
> Thanks
> Oleg.
>
> oleg@robinhood:~/install/spark-1.6.0-bin-hadoop2.6/ec2$ ./spark-ec2
> --key-pair=CC-ES-Demo
>  
> --identity-file=/home/oleg/work/entity_extraction_framework/ec2_pem_key/CC-ES-Demo.pem
> --region=us-east-1 --zone=us-east-1a --spot-price=0.05   -s 5
> --spark-version=1.6.0launch entity-extraction-spark-cluster
> Setting up security groups...
> Searching for existing cluster entity-extraction-spark-cluster in region
> us-east-1...
> Spark AMI: ami-5bb18832
> Launching instances...
> Requesting 5 slaves as spot instances with price $0.050
> Waiting for spot instances to be granted...
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> All 5 slaves granted
> Launched master in us-east-1a, regid = r-9384033f
> Waiting for AWS to propagate instance metadata...
> Waiting for cluster to enter 'ssh-ready' state..
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-52-90-186-83.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
> port 22: Connection refused
>
> .
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-52-90-186-83.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
> port 22: Connection refused
>
> .
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-52-90-186-83.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
> port 22: Connection refused
>
> .
> Cluster is now in 'ssh-ready' state. Waited 442 seconds.
> Generating cluster's SSH key on master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Transferring cluster's SSH key to slaves...
> ec2-54-165-243-74.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-165-243-74.compute-1.amazonaws.com,54.165.243.74'
> (ECDSA) to the list of known hosts.
> ec2-54-88-245-107.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-88-245-107.compute-1.amazonaws.com,54.88.245.107'
> (ECDSA) to the list of known hosts.
> ec2-54-172-29-47.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-172-29-47.compute-1.amazonaws.com,54.172.29.47'
> (ECDSA) to the list of known hosts.
> ec2-54-165-131-210.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-165-131-210.compute-1.amazonaws.com,54.165.131.210'
> (ECDSA) to the list of known hosts.
> ec2-54-172-46-184.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-172-46-184.compute-1.amazonaws.com,54.172.46.184'
> (ECDSA) to the list of known hosts.
> Cloning spark-ec2 scripts from
> https://github.com/amplab/spark-ec2/tree/branch-1.5 on master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Cloning into 'spark-ec2'...
> remote: Counting objects: 2068, done.
> remote: Total 2068 (delta 0), reused 0 (delta 0), pack-reused 2068
> Receiving objects: 100% (2068/2068), 349.76 KiB, done.
> Resolving deltas: 100% (796/796), done.
> Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
> Deploying files to master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> sending incremental file list
> root/spark-ec2/ec2-variables.sh
>
> sent 1,835 bytes  received 40 bytes  416.67 bytes/sec
> total size is 1,684  speedup is 0.90
> Running setup on master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Setting up Spark on

Re: Is there a test like MiniCluster example in Spark just like hadoop ?

2016-01-18 Thread Ted Yu

Please refer to the following suites:

yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala
core/src/test/scala/org/apache/spark/scheduler/SparkListenerWithClusterSuite.scala

Cheers

On Mon, Jan 18, 2016 at 2:14 AM, zml张明磊  wrote:

> Hello,
>
>
>
>I want to find some test file in spark which support the same
> function just like in Hadoop MiniCluster test environment. But I can not
> find them. Anyone know about that ?
>

Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-18 Thread Ted Yu

Can you pastebin master log before the error showed up ?

The initial message was posted for Spark 1.2.0
Which release of Spark / zookeeper do you use ?

Thanks

On Mon, Jan 18, 2016 at 6:47 AM, doctorx  wrote:

> Hi,
> I am facing the same issue, with the given error
>
> ERROR Master:75 - Leadership has been revoked -- master shutting down.
>
> Can anybody help. Any clue will be useful. Should i change something in
> spark cluster or zookeeper. Is there any setting in spark which can help
> me?
>
> Thanks & Regards
> Raghvendra
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-0-standalone-ha-zookeeper-tp21308p25994.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

spark ml Dataframe vs Labeled Point RDD Mllib speed

2016-01-18 Thread jarias

Hi all,

I've been recently playing with the ml API in spark 1.6.0 as I'm in the
process of implementing a series of new classifiers for my Phd thesis. There
are some questions that have arisen regarding the scalability of the
different data pipelines that can be used to load the training datasets into
the ML algorithms.

I used to load the data into spark by creating a raw RDD from a text file
(csv data points) and parse them directly into LabeledPoints to be used with
the learning algorithms.

Right now, I see that the spark ml API is built on top of Dataframes, so
I've changed my scripts to load data into this data structure by using the
databricks csv connector. However the speed of this methods seems to be 2 or
4 times slower that the original one. I can understand that creating a
Dataframe can be costly for the systems, although using a pre-specified
schema. In addition, it seems that working with the "Row" class is also
slower than accessing the data inside LabeledPoints.

My last point is that it seems that all the proposed algorithms work with
Dataframes in which the rows are composed of Rows, which imply an additional parsing through the data when
loading from a text file.

I'm sorry if I'm just messing some concepts from the documentation, but
after an intensive experimentation I don't really see a clear strategy to
use these different elements. Any thoughts would be really appreciated :)

Cheers,

jarias

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ml-Dataframe-vs-Labeled-Point-RDD-Mllib-speed-tp25995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

86 matches

Mail list logo