Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-18 Thread doctorx
Hi,
I am facing the same issue, with the given error

ERROR Master:75 - Leadership has been revoked -- master shutting down.

Can anybody help. Any clue will be useful. Should i change something in
spark cluster or zookeeper. Is there any setting in spark which can help me?

Thanks & Regards
Raghvendra



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-0-standalone-ha-zookeeper-tp21308p25994.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Calling SparkContext methods in scala Future

2016-01-18 Thread Marco
Hello,

I am using Spark 1.5.1 within SBT, and Scala 2.10.6 and I am facing an
issue with the SparkContext.

Basically, I have an object that needs to do several things:

- call an external service One (web api)
- call an external service Two (another api)
- read and produce an RDD from HDFS (Spark)
- parallelize the data obtained in the first two calls
- join these different rdds, do stuff with them...

Now, I am trying to do it in an asynchronous way. This doesn't seem to
work, though. My guess is that Spark doesn't see the calls to .parallelize,
as they are made in different tasks (or Future, therefore this code is
called before/later or maybe with an unset Context (can it be?)). I have
tried different ways, one of these being the call to SparkEnv.set in the
calls to flatMap and map (in the Future). However, all I get is Cannot call
methods on a stopped SparkContext. It just doesnt'work - maybe I just
misunderstood what it does, therefore I removed it.

This is the code I have written so far:

object Fetcher {

  def fetch(name, master, ...) = {
val externalCallOne: Future[WSResponse] = externalService1()
val externalCallTwo: Future[String] = externalService2()
// val sparkEnv = SparkEnv.get
val config = new SparkConf()
.setAppName(name)
.set("spark.master", master)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

val sparkContext = new SparkContext(config)
//val sparkEnv = SparkEnv.get

val eventuallyJoinedData = externalCallOne flatMap { dataOne =>
  // SparkEnv.set(sparkEnv)
  externalCallTwo map { dataTwo =>
println("in map") // prints, so it gets here ...
val rddOne = sparkContext.parallelize(dataOne)
val rddTwo = sparkContext.parallelize(dataTwo)
// do stuff here ... foreach/println, and

val joinedData = rddOne leftOuterJoin (rddTwo)
  }
}
eventuallyJoinedData onSuccess { case success => ...  }
eventuallyJoinedData onFailure { case error =>
println(error.getMessage) }
// sparkContext.stop
  }

}
As you can see, I have also tried to comment the line to stop the context,
but then I get another issue:

13:09:14.929 [ForkJoinPool-1-worker-5] INFO  org.apache.spark.SparkContext
- Starting job: count at Fetcher.scala:38
13:09:14.932 [shuffle-server-0] DEBUG io.netty.channel.nio.NioEventLoop -
Selector.select() returned prematurely because
Thread.currentThread().interrupt() was called. Use
NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
13:09:14.936 [Spark Context Cleaner] ERROR org.apache.spark.ContextCleaner
- Error in cleaning thread
java.lang.InterruptedException: null
at java.lang.Object.wait(Native Method) ~[na:1.8.0_65]
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
~[na:1.8.0_65]
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:157)
~[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
[spark-core_2.10-1.5.1.jar:1.5.1]
at 
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154)
[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67)
[spark-core_2.10-1.5.1.jar:1.5.1]
13:09:14.940 [db-async-netty-thread-1] DEBUG
io.netty.channel.nio.NioEventLoop - Selector.select() returned prematurely
because Thread.currentThread().interrupt() was called. Use
NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
13:09:14.943 [SparkListenerBus] ERROR org.apache.spark.util.Utils -
uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.InterruptedException: null
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
~[na:1.8.0_65]
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
~[na:1.8.0_65]
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
~[na:1.8.0_65]
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:65)
~[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
~[spark-core_2.10-1.5.1.jar:1.5.1]
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
[spark-core_2.10-1.5.1.jar:1.5.1]
13:09:14.949 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle -
stopping org.spark-project.jetty.server.Server@787cbcef
13:09:14.959 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle -
stopping SelectChannelConnector@0.0.0.0:4040
13:09:14.959 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle -
stopping
org.spark-project.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager@797cc465
As you can see, it tries to call the count 

Re: SparkContext SyntaxError: invalid syntax

2016-01-18 Thread Andrew Weiner
Hi Felix,

Yeah, when I try to build the docs using jekyll build, I get a LoadError
(cannot load such file -- pygments) and I'm having trouble getting past it
at the moment.

>From what I could tell, this does not apply to YARN in client mode.  I was
able to submit jobs in client mode and they would run fine without using
the appMasterEnv property.  I even confirmed that my environment variables
persisted during the job when run in client mode.  There is something about
YARN cluster mode that uses a different environment (the YARN Application
Master environment) and requires the appMasterEnv property for setting
environment variables.

On Sun, Jan 17, 2016 at 11:37 PM, Felix Cheung 
wrote:

> Do you still need help on the PR?
> btw, does this apply to YARN client mode?
>
> --
> From: andrewweiner2...@u.northwestern.edu
> Date: Sun, 17 Jan 2016 17:00:39 -0600
> Subject: Re: SparkContext SyntaxError: invalid syntax
> To: cutl...@gmail.com
> CC: user@spark.apache.org
>
>
> Yeah, I do think it would be worth explicitly stating this in the docs.  I
> was going to try to edit the docs myself and submit a pull request, but I'm
> having trouble building the docs from github.  If anyone else wants to do
> this, here is approximately what I would say:
>
> (To be added to
> http://spark.apache.org/docs/latest/configuration.html#environment-variables
> )
> "Note: When running Spark on YARN in cluster mode, environment variables
> need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]
> property in your conf/spark-defaults.conf file.  Environment variables
> that are set in spark-env.sh will not be reflected in the YARN
> Application Master process in cluster mode.  See the YARN-related Spark
> Properties
> 
> for more information."
>
> I might take another crack at building the docs myself if nobody beats me
> to this.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler  wrote:
>
> Glad you got it going!  It's wasn't very obvious what needed to be set,
> maybe it is worth explicitly stating this in the docs since it seems to
> have come up a couple times before too.
>
> Bryan
>
> On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> Actually, I just found this [
> https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of
> googling and reading leads me to believe that the preferred way to change
> the yarn environment is to edit the spark-defaults.conf file by adding this
> line:
> spark.yarn.appMasterEnv.PYSPARK_PYTHON/path/to/python
>
> While both this solution and the solution from my prior email work, I
> believe this is the preferred solution.
>
> Sorry for the flurry of emails.  Again, thanks for all the help!
>
> Andrew
>
> On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> I finally got the pi.py example to run in yarn cluster mode.  This was the
> key insight:
> https://issues.apache.org/jira/browse/SPARK-9229
>
> I had to set SPARK_YARN_USER_ENV in spark-env.sh:
> export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
>
> This caused the PYSPARK_PYTHON environment variable to be used in my yarn
> environment in cluster mode.
>
> Thank you for all your help!
>
> Best,
> Andrew
>
>
>
> On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> I tried playing around with my environment variables, and here is an
> update.
>
> When I run in cluster mode, my environment variables do not persist
> throughout the entire job.
> For example, I tried creating a local copy of HADOOP_CONF_DIR in
> /home//local/etc/hadoop/conf, and then, in spark-env.sh I the
> variable:
> export HADOOP_CONF_DIR=/home//local/etc/hadoop/conf
>
> Later, when we print the environment variables in the python code, I see
> this:
>
> ('HADOOP_CONF_DIR', '/etc/hadoop/conf')
>
> However, when I run in client mode, I see this:
>
> ('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
>
> Furthermore, if I omit that environment variable from spark-env.sh 
> altogether, I get the expected error in both client and cluster mode:
>
> When running with master 'yarn'
> either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
>
> This suggests that my environment variables are being used when I first 
> submit the job, but at some point during the job, my environment variables 
> are thrown out and someone's (yarn's?) environment variables are being used.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> Indeed!  Here is the output when I run in cluster mode:
>
> Traceback (most recent call last):
>   File "pi.py", line 22, in ?
> raise RuntimeError("\n"+str(sys.version_info) +"\n"+
> RuntimeError:
> (2, 4, 3, 'final', 

Calling SparkContext methods in scala Future

2016-01-18 Thread makronized
I am using Spark 1.5.1 within SBT, and Scala 2.10.6 and I am writing because
I am facing an issue with the SparkContext. 

Basically, I have an object that needs to do several things:

- call an external service One (web api)
- call an external service Two (another api)
- read and produce an RDD from HDFS (Spark)
- parallelize the data obtained in the first two calls
- join these different rdds, do stuff with them...

Now, I am trying to do it in an asynchronous way. This doesn't seem to work,
though. My guess is that Spark doesn't see the calls to .parallelize, as
they are made in different tasks (or Future, therefore this code is called
before/later or maybe with an unset Context (can it be?)). I have tried
different ways, one of these being the call to SparkEnv.set in the calls to
flatMap and map (in the Future). However, all I get is Cannot call methods
on a stopped SparkContext. It just doesnt'work - maybe I just misunderstood
what it does, therefore I removed it.

This is the code I have written so far:

object Fetcher { 

  def fetch(name, master, ...) = { 
val externalCallOne: Future[WSResponse] = externalService1()
val externalCallTwo: Future[String] = externalService2()
// val sparkEnv = SparkEnv.get
val config = new SparkConf()
.setAppName(name)
.set("spark.master", master)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

val sparkContext = new SparkContext(config)
//val sparkEnv = SparkEnv.get

val eventuallyJoinedData = externalCallOne flatMap { dataOne =>
  // SparkEnv.set(sparkEnv)
  externalCallTwo map { dataTwo =>
println("in map") // prints, so it gets here ...
val rddOne = sparkContext.parallelize(dataOne)
val rddTwo = sparkContext.parallelize(dataTwo)
// do stuff here ... foreach/println, and 

val joinedData = rddOne leftOuterJoin (rddTwo)
  }
} 
eventuallyJoinedData onSuccess { case success => ...  }
eventuallyJoinedData onFailure { case error => println(error.getMessage)
} 
// sparkContext.stop 
  } 

}
As you can see, I have also tried to comment the line to stop the context,
but then I get another issue:

13:09:14.929 [ForkJoinPool-1-worker-5] INFO  org.apache.spark.SparkContext -
Starting job: count at Fetcher.scala:38
13:09:14.932 [shuffle-server-0] DEBUG io.netty.channel.nio.NioEventLoop -
Selector.select() returned prematurely because
Thread.currentThread().interrupt() was called. Use
NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
13:09:14.936 [Spark Context Cleaner] ERROR org.apache.spark.ContextCleaner -
Error in cleaning thread
java.lang.InterruptedException: null
at java.lang.Object.wait(Native Method) ~[na:1.8.0_65]
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
~[na:1.8.0_65]
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:157)
~[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
[spark-core_2.10-1.5.1.jar:1.5.1]
at
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154)
[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67)
[spark-core_2.10-1.5.1.jar:1.5.1]
13:09:14.940 [db-async-netty-thread-1] DEBUG
io.netty.channel.nio.NioEventLoop - Selector.select() returned prematurely
because Thread.currentThread().interrupt() was called. Use
NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
13:09:14.943 [SparkListenerBus] ERROR org.apache.spark.util.Utils - uncaught
error in thread SparkListenerBus, stopping SparkContext
java.lang.InterruptedException: null
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
~[na:1.8.0_65]
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
~[na:1.8.0_65]
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
~[na:1.8.0_65]
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:65)
~[spark-core_2.10-1.5.1.jar:1.5.1]
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
~[spark-core_2.10-1.5.1.jar:1.5.1]
at
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
[spark-core_2.10-1.5.1.jar:1.5.1]
13:09:14.949 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle -
stopping org.spark-project.jetty.server.Server@787cbcef
13:09:14.959 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle -
stopping SelectChannelConnector@0.0.0.0:4040
13:09:14.959 [SparkListenerBus] DEBUG o.s.j.u.component.AbstractLifeCycle -
stopping
org.spark-project.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager@797cc465
As you can see, it 

Re: Using JDBC clients with "Spark on Hive"

2016-01-18 Thread Ricardo Paiva
Are you running the Spark Thrift JDBC/ODBC server?

In my environment I have a Hive Metastore server and the Spark Thrift
Server pointing to the Hive Metastore.

I use the Hive beeline tool for testing. With this setup I'm able to use
Tableau connecting to Hive tables and using Spark SQL as the engine.

Regards,

Ricardo


On Thu, Jan 14, 2016 at 11:15 PM, sdevashis [via Apache Spark User List] <
ml-node+s1001560n25976...@n3.nabble.com> wrote:

> Hello Experts,
>
> I am getting started with Hive with Spark as the query engine. I built the
> package from sources. I am able to invoke Hive CLI and run queries and see
> in Ambari that Spark application are being created confirming hive is using
> Spark as the engine.
>
> However other than Hive CLI, I am not able to run queries from any other
> clients that use the JDBC to connect to hive through thrift. I tried
> Squirrel, Aginity Netezza workbench, and even Hue.
>
> No yarn applications are getting created, the query times out after
> sometime. Nothing gets into /tmp/user/hive.log Am I missing something?
>
> Again I am using Hive on Spark and not spark SQL.
>
> Version Info:
> Spark 1.4.1 built for Hadoop 2.4
>
>
> Thank you in advance for any pointers.
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-JDBC-clients-with-Spark-on-Hive-tp25976.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-JDBC-clients-with-Spark-on-Hive-tp25976p25988.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming: BatchDuration and Processing time

2016-01-18 Thread Ricardo Paiva
If you are using Kafka as the message queue, Spark will process accordingly
the time slices, even if it is late, like in your example. But it will fail
sometime, due the fact that your process will ask for a message that is
older than the oldest message in Kafka.
If your process takes longer than the streaming time, let's say at your
system peak time during day, but it takes much less time at night, when
your system is mostly idle, the streaming will work and process correctly
(though it's risky if the late time slices don't finish during the idle
time).

Best thing to do is try to optimize your job to fit at the time streaming
time and avoid overflows. :)

Regards,

Ricardo





On Sun, Jan 17, 2016 at 2:32 PM, pyspark2555 [via Apache Spark User List] <
ml-node+s1001560n25986...@n3.nabble.com> wrote:

> Hi,
>
> If BatchDuration is set to 1 second in StreamingContext and the actual
> processing time is longer than one second, then how does Spark handle that?
>
> For example, I am receiving a continuous Input stream. Every 1 second
> (batch duration), the RDDs will be processed. What if this processing time
> is longer than 1 second? What happens in the next batch duration?
>
> Thanks.
> Amit
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-BatchDuration-and-Processing-time-tp25986.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-BatchDuration-and-Processing-time-tp25986p25989.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark streaming: Fixed time aggregation & handling driver failures

2016-01-18 Thread Ricardo Paiva
I don't know if this is the most efficient way to do that, but you can use
a sliding window that is bigger than your aggregation period and filter
only for the messages inside the period.

Remember that to work with the reduceByKeyAndWindow you need to associate
each row with the time key, in your case "MMddhhmm".

http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

Hope it helps,

Regards,

Ricardo



On Sat, Jan 16, 2016 at 1:13 AM, ffarozan [via Apache Spark User List] <
ml-node+s1001560n25982...@n3.nabble.com> wrote:

> I am implementing aggregation using spark streaming and kafka. My batch
> and window size are same. And the aggregated data is persisted in
> Cassandra.
>
> I want to aggregate for fixed time windows - 5:00, 5:05, 5:10, ...
>
> But we cannot control when to run streaming job, we only get to specify
> the batch interval.
>
> So the problem is - lets say if streaming job starts at 5:02, then I will
> get results at 5:07, 5:12, etc. and not what I want.
>
> Any suggestions?
>
> thanks,
> Firdousi
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-Fixed-time-aggregation-handling-driver-failures-tp25982.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-Fixed-time-aggregation-handling-driver-failures-tp25982p25990.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Daniel Darabos
Hi,

How do you know it doesn't work? The log looks roughly normal to me. Is
Spark not running at the printed address? Can you not start jobs?

On Mon, Jan 18, 2016 at 11:51 AM, Oleg Ruchovets 
wrote:

> Hi ,
>I try to follow the spartk 1.6.0 to install spark on EC2.
>
> It doesn't work properly -  got exceptions and at the end standalone spark
> cluster installed.
>

The purpose of the script is to install a standalone Spark cluster. So
that's not an error :).


> here is log information:
>
> Any suggestions?
>
> Thanks
> Oleg.
>
> oleg@robinhood:~/install/spark-1.6.0-bin-hadoop2.6/ec2$ ./spark-ec2
> --key-pair=CC-ES-Demo
>  
> --identity-file=/home/oleg/work/entity_extraction_framework/ec2_pem_key/CC-ES-Demo.pem
> --region=us-east-1 --zone=us-east-1a --spot-price=0.05   -s 5
> --spark-version=1.6.0launch entity-extraction-spark-cluster
> Setting up security groups...
> Searching for existing cluster entity-extraction-spark-cluster in region
> us-east-1...
> Spark AMI: ami-5bb18832
> Launching instances...
> Requesting 5 slaves as spot instances with price $0.050
> Waiting for spot instances to be granted...
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> All 5 slaves granted
> Launched master in us-east-1a, regid = r-9384033f
> Waiting for AWS to propagate instance metadata...
> Waiting for cluster to enter 'ssh-ready' state..
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-52-90-186-83.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
> port 22: Connection refused
>
> .
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-52-90-186-83.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
> port 22: Connection refused
>
> .
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-52-90-186-83.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
> port 22: Connection refused
>
> .
> Cluster is now in 'ssh-ready' state. Waited 442 seconds.
> Generating cluster's SSH key on master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Transferring cluster's SSH key to slaves...
> ec2-54-165-243-74.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-165-243-74.compute-1.amazonaws.com,54.165.243.74'
> (ECDSA) to the list of known hosts.
> ec2-54-88-245-107.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-88-245-107.compute-1.amazonaws.com,54.88.245.107'
> (ECDSA) to the list of known hosts.
> ec2-54-172-29-47.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-172-29-47.compute-1.amazonaws.com,54.172.29.47'
> (ECDSA) to the list of known hosts.
> ec2-54-165-131-210.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-165-131-210.compute-1.amazonaws.com,54.165.131.210'
> (ECDSA) to the list of known hosts.
> ec2-54-172-46-184.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-172-46-184.compute-1.amazonaws.com,54.172.46.184'
> (ECDSA) to the list of known hosts.
> Cloning spark-ec2 scripts from
> https://github.com/amplab/spark-ec2/tree/branch-1.5 on master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Cloning into 'spark-ec2'...
> remote: Counting objects: 2068, done.
> remote: Total 2068 (delta 0), reused 0 (delta 0), pack-reused 2068
> Receiving objects: 100% (2068/2068), 349.76 KiB, done.
> Resolving deltas: 100% (796/796), done.
> Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
> Deploying files to master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> sending incremental file list
> root/spark-ec2/ec2-variables.sh
>
> sent 1,835 bytes  received 40 bytes  416.67 bytes/sec
> total size is 1,684  speedup is 0.90
> Running setup on master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Setting up Spark on 

Re: Is there a test like MiniCluster example in Spark just like hadoop ?

2016-01-18 Thread Ted Yu
Please refer to the following suites:

yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala
core/src/test/scala/org/apache/spark/scheduler/SparkListenerWithClusterSuite.scala

Cheers

On Mon, Jan 18, 2016 at 2:14 AM, zml张明磊  wrote:

> Hello,
>
>
>
>I want to find some test file in spark which support the same
> function just like in Hadoop MiniCluster test environment. But I can not
> find them. Anyone know about that ?
>


Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-18 Thread Ted Yu
Can you pastebin master log before the error showed up ?

The initial message was posted for Spark 1.2.0
Which release of Spark / zookeeper do you use ?

Thanks

On Mon, Jan 18, 2016 at 6:47 AM, doctorx  wrote:

> Hi,
> I am facing the same issue, with the given error
>
> ERROR Master:75 - Leadership has been revoked -- master shutting down.
>
> Can anybody help. Any clue will be useful. Should i change something in
> spark cluster or zookeeper. Is there any setting in spark which can help
> me?
>
> Thanks & Regards
> Raghvendra
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-0-standalone-ha-zookeeper-tp21308p25994.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


spark ml Dataframe vs Labeled Point RDD Mllib speed

2016-01-18 Thread jarias
Hi all,

I've been recently playing with the ml API in spark 1.6.0 as I'm in the
process of implementing a series of new classifiers for my Phd thesis. There
are some questions that have arisen regarding the scalability of the
different data pipelines that can be used to load the training datasets into
the ML algorithms.

I used to load the data into spark by creating a raw RDD from a text file
(csv data points) and parse them directly into LabeledPoints to be used with
the learning algorithms.

Right now, I see that the spark ml API is built on top of Dataframes, so
I've changed my scripts to load data into this data structure by using the
databricks csv connector. However the speed of this methods seems to be 2 or
4 times slower that the original one. I can understand that creating a
Dataframe can be costly for the systems, although using a pre-specified
schema. In addition, it seems that working with the "Row" class is also
slower than accessing the data inside LabeledPoints.

My last point is that it seems that all the proposed algorithms work with
Dataframes in which the rows are composed of  Rows, which imply an additional parsing through the data when
loading from a text file.

I'm sorry if I'm just messing some concepts from the documentation, but
after an intensive experimentation I don't really see a clear strategy to
use these different elements. Any thoughts would be really appreciated :)

Cheers,

jarias



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ml-Dataframe-vs-Labeled-Point-RDD-Mllib-speed-tp25995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark App -Yarn-Cluster-Mode ===> Hadoop_conf_**.zip file.

2016-01-18 Thread Siddharth Ubale
Hi,

Thanks for pointing out the Phoenix discrepancy! I was using a phoenix 4.4. jar 
for hbase 1.1 release however I was using Hbase 0.98 . Have fixed the above 
issue.
I am still unable to go ahead with the Streaming job in cluster mode with the 
following trace:

Application application_1452763526769_0021 failed 2 times due to AM Container 
for appattempt_1452763526769_0021_02 exited with exitCode: -1000
For more detailed output, check application tracking 
page:http://slave1:8088/proxy/application_1452763526769_0021/Then, click on 
links to logs of each attempt.
Diagnostics: File does not exist: 
hdfs://slave1:9000/user/hduser/.sparkStaging/application_1452763526769_0021/__hadoop_conf__7080838197861423764.zip
java.io.FileNotFoundException: File does not exist: 
hdfs://slave1:9000/user/hduser/.sparkStaging/application_1452763526769_0021/__hadoop_conf__7080838197861423764.zip
at 
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1122)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Failing this attempt. Failing the application.


Not sure what’s this _Hadoop_conf-XXX.zip. Can some one pls guide me .
I am submitting a Spark Streaming job whjic is reading from a Kafka topic and 
dumoing data in to hbase tables via Phoenix API. The job is behaving as 
expected in local mode.


Thanks
Siddharth Ubale

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, January 15, 2016 8:08 PM
To: Siddharth Ubale 
Cc: user@spark.apache.org
Subject: Re: Spark App -Yarn-Cluster-Mode ===> Hadoop_conf_**.zip file.

Interesting. Which hbase / Phoenix releases are you using ?
The following method has been removed from Put:

   public Put setWriteToWAL(boolean write) {

Please make sure the Phoenix release is compatible with your hbase version.

Cheers

On Fri, Jan 15, 2016 at 6:20 AM, Siddharth Ubale 
> wrote:
Hi,


This is the log from the application :

16/01/15 19:23:19 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster 
with SUCCEEDED (diag message: Shutdown hook called before final status was 
reported.)
16/01/15 19:23:19 INFO yarn.ApplicationMaster: Deleting staging directory 
.sparkStaging/application_1452763526769_0011
16/01/15 19:23:19 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
Shutting down remote daemon.
16/01/15 19:23:19 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote 
daemon shut down; proceeding with flushing remote transports.
16/01/15 19:23:19 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
Remoting shut down.
16/01/15 19:23:19 INFO util.Utils: Shutdown hook called
16/01/15 19:23:19 INFO client.HConnectionManager$HConnectionImplementation: 
Closing zookeeper sessionid=0x1523f753f6f0061
16/01/15 19:23:19 INFO zookeeper.ClientCnxn: EventThread shut down
16/01/15 19:23:19 INFO zookeeper.ZooKeeper: Session: 0x1523f753f6f0061 closed
16/01/15 19:23:19 ERROR yarn.ApplicationMaster: User class threw exception: 
java.lang.NoSuchMethodError: 
org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
java.lang.NoSuchMethodError: 
org.apache.hadoop.hbase.client.Put.setWriteToWAL(Z)Lorg/apache/hadoop/hbase/client/Put;
at 
org.apache.phoenix.schema.PTableImpl$PRowImpl.newMutations(PTableImpl.java:639)
at 
org.apache.phoenix.schema.PTableImpl$PRowImpl.(PTableImpl.java:632)
at 
org.apache.phoenix.schema.PTableImpl.newRow(PTableImpl.java:557)
at 
org.apache.phoenix.schema.PTableImpl.newRow(PTableImpl.java:573)
at 
org.apache.phoenix.execute.MutationState.addRowMutations(MutationState.java:185)
at 

Re: Error in Spark Executors when trying to read HBase table from Spark with Kerberos enabled

2016-01-18 Thread Vinay Kashyap
Hi Guys,

Any help regarding this issue..??



On Wed, Jan 13, 2016 at 6:39 PM, Vinay Kashyap  wrote:

> Hi all,
>
> I am using  *Spark 1.5.1 in YARN cluster mode in CDH 5.5.*
> I am trying to create an RDD by reading HBase table with kerberos enabled.
> I am able to launch the spark job to read the HBase table, but I notice
> that the executors launched for the job cannot proceed due to an issue with
> Kerberos and they are stuck indefinitely.
>
> Below is my code to read a HBase table.
>
>
> *Configuration configuration = HBaseConfiguration.create();*
> *  configuration.set(TableInputFormat.INPUT_TABLE,
> frameStorage.getHbaseStorage().getTableId());*
> *  String hbaseKerberosUser = "sparkUser";*
> *  String hbaseKerberosKeytab = "";*
> *  if (!hbaseKerberosUser.trim().isEmpty() &&
> !hbaseKerberosKeytab.trim().isEmpty()) {*
> *configuration.set("hadoop.security.authentication", "kerberos");*
> *configuration.set("hbase.security.authentication", "kerberos");*
> *configuration.set("hbase.security.authorization", "true");*
> *configuration.set("hbase.rpc.protection", "authentication");*
> *configuration.set("hbase.master.kerberos.principal",
> "hbase/_HOST@CERT.LOCAL");*
> *configuration.set("hbase.regionserver.kerberos.principal",
> "hbase/_HOST@CERT.LOCAL");*
> *configuration.set("hbase.rest.kerberos.principal",
> "hbase/_HOST@CERT.LOCAL");*
> *configuration.set("hbase.thrift.kerberos.principal",
> "hbase/_HOST@CERT.LOCAL");*
> *configuration.set("hbase.master.keytab.file",
> hbaseKerberosKeytab);*
> *configuration.set("hbase.regionserver.keytab.file",
> hbaseKerberosKeytab);*
> *configuration.set("hbase.rest.authentication.kerberos.keytab",
> hbaseKerberosKeytab);*
> *configuration.set("hbase.thrift.keytab.file",
> hbaseKerberosKeytab);*
> *UserGroupInformation.setConfiguration(configuration);*
> *if (UserGroupInformation.isSecurityEnabled()) {*
> *  UserGroupInformation ugi = UserGroupInformation*
> *  .loginUserFromKeytabAndReturnUGI(hbaseKerberosUser,
> hbaseKerberosKeytab);*
> *  TokenUtil.obtainAndCacheToken(configuration, ugi);*
> *}*
> *  }*
>
> *  System.out.println("loading HBase Table RDD ...");*
> *  JavaPairRDD hbaseTableRDD =
> this.sparkContext.newAPIHadoopRDD(*
> *  configuration, TableInputFormat.class,
> ImmutableBytesWritable.class, Result.class);*
> *  JavaRDD tableRDD = getTableRDD(hbaseTableRDD, dataFrameModel);*
> *  System.out.println("Count :: " + tableRDD.count());*
> Following is the error which I can see in the container logs
>
> *16/01/13 10:01:42 WARN security.UserGroupInformation:
> PriviledgedActionException as:sparkUser (auth:SIMPLE)
> cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]*
> *16/01/13 10:01:42 WARN ipc.RpcClient: Exception encountered while
> connecting to the server : javax.security.sasl.SaslException: GSS initiate
> failed [Caused by GSSException: No valid credentials provided (Mechanism
> level: Failed to find any Kerberos tgt)]*
> *16/01/13 10:01:42 ERROR ipc.RpcClient: SASL authentication failed. The
> most likely cause is missing or invalid credentials. Consider 'kinit'.*
> *javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]*
> * at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)*
> * at
> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)*
> * at
> org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupSaslConnection(RpcClient.java:770)*
> * at
> org.apache.hadoop.hbase.ipc.RpcClient$Connection.access$600(RpcClient.java:357)*
> * at
> org.apache.hadoop.hbase.ipc.RpcClient$Connection$2.run(RpcClient.java:891)*
> * at
> org.apache.hadoop.hbase.ipc.RpcClient$Connection$2.run(RpcClient.java:888)*
> * at java.security.AccessController.doPrivileged(Native Method)*
> * at javax.security.auth.Subject.doAs(Subject.java:415)*
>
> Have valid Kerberos Token as can be seen below:
>
> sparkUser@infra:/ebs1/agent$ klist
> Ticket cache: FILE:/tmp/krb5cc_1001
> Default principal: sparkUser@CERT.LOCAL
>
> Valid startingExpires   Service principal
> 13/01/2016 12:07  14/01/2016 12:07  krbtgt/CERT.LOCAL@CERT.LOCAL
>
> Also, I confirmed that only reading from HBase is giving this problem.
> Because I can read a simple file in HDFS and I am able to create the RDD as
> required.
> After digging through some contents in the net, found that there is a
> ticket in JIRA which is logged which is similar to what I am experiencing
> *https://issues.apache.org/jira/browse/SPARK-12279
> 

spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Oleg Ruchovets
Hi ,
   I try to follow the spartk 1.6.0 to install spark on EC2.

It doesn't work properly -  got exceptions and at the end standalone spark
cluster installed.
here is log information:

Any suggestions?

Thanks
Oleg.

oleg@robinhood:~/install/spark-1.6.0-bin-hadoop2.6/ec2$ ./spark-ec2
--key-pair=CC-ES-Demo
 
--identity-file=/home/oleg/work/entity_extraction_framework/ec2_pem_key/CC-ES-Demo.pem
--region=us-east-1 --zone=us-east-1a --spot-price=0.05   -s 5
--spark-version=1.6.0launch entity-extraction-spark-cluster
Setting up security groups...
Searching for existing cluster entity-extraction-spark-cluster in region
us-east-1...
Spark AMI: ami-5bb18832
Launching instances...
Requesting 5 slaves as spot instances with price $0.050
Waiting for spot instances to be granted...
0 of 5 slaves granted, waiting longer
0 of 5 slaves granted, waiting longer
0 of 5 slaves granted, waiting longer
0 of 5 slaves granted, waiting longer
0 of 5 slaves granted, waiting longer
0 of 5 slaves granted, waiting longer
0 of 5 slaves granted, waiting longer
0 of 5 slaves granted, waiting longer
0 of 5 slaves granted, waiting longer
All 5 slaves granted
Launched master in us-east-1a, regid = r-9384033f
Waiting for AWS to propagate instance metadata...
Waiting for cluster to enter 'ssh-ready' state..

Warning: SSH connection error. (This could be temporary.)
Host: ec2-52-90-186-83.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
port 22: Connection refused

.

Warning: SSH connection error. (This could be temporary.)
Host: ec2-52-90-186-83.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
port 22: Connection refused

.

Warning: SSH connection error. (This could be temporary.)
Host: ec2-52-90-186-83.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
port 22: Connection refused

.
Cluster is now in 'ssh-ready' state. Waited 442 seconds.
Generating cluster's SSH key on master...
Warning: Permanently added
'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
(ECDSA) to the list of known hosts.
Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
Warning: Permanently added
'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
(ECDSA) to the list of known hosts.
Transferring cluster's SSH key to slaves...
ec2-54-165-243-74.compute-1.amazonaws.com
Warning: Permanently added
'ec2-54-165-243-74.compute-1.amazonaws.com,54.165.243.74'
(ECDSA) to the list of known hosts.
ec2-54-88-245-107.compute-1.amazonaws.com
Warning: Permanently added
'ec2-54-88-245-107.compute-1.amazonaws.com,54.88.245.107'
(ECDSA) to the list of known hosts.
ec2-54-172-29-47.compute-1.amazonaws.com
Warning: Permanently added
'ec2-54-172-29-47.compute-1.amazonaws.com,54.172.29.47'
(ECDSA) to the list of known hosts.
ec2-54-165-131-210.compute-1.amazonaws.com
Warning: Permanently added
'ec2-54-165-131-210.compute-1.amazonaws.com,54.165.131.210'
(ECDSA) to the list of known hosts.
ec2-54-172-46-184.compute-1.amazonaws.com
Warning: Permanently added
'ec2-54-172-46-184.compute-1.amazonaws.com,54.172.46.184'
(ECDSA) to the list of known hosts.
Cloning spark-ec2 scripts from
https://github.com/amplab/spark-ec2/tree/branch-1.5 on master...
Warning: Permanently added
'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
(ECDSA) to the list of known hosts.
Cloning into 'spark-ec2'...
remote: Counting objects: 2068, done.
remote: Total 2068 (delta 0), reused 0 (delta 0), pack-reused 2068
Receiving objects: 100% (2068/2068), 349.76 KiB, done.
Resolving deltas: 100% (796/796), done.
Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
Deploying files to master...
Warning: Permanently added
'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
(ECDSA) to the list of known hosts.
sending incremental file list
root/spark-ec2/ec2-variables.sh

sent 1,835 bytes  received 40 bytes  416.67 bytes/sec
total size is 1,684  speedup is 0.90
Running setup on master...
Warning: Permanently added
'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
(ECDSA) to the list of known hosts.
Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
Warning: Permanently added
'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
(ECDSA) to the list of known hosts.
Setting up Spark on ip-172-31-24-124.ec2.internal...
Setting executable permissions on scripts...
RSYNC'ing /root/spark-ec2 to other cluster nodes...
ec2-54-165-243-74.compute-1.amazonaws.com
Warning: Permanently added
'ec2-54-165-243-74.compute-1.amazonaws.com,172.31.19.61'
(ECDSA) to the list of known hosts.
ec2-54-88-245-107.compute-1.amazonaws.com
id_rsa

 100% 1679 1.6KB/s   00:00
Warning: Permanently added
'ec2-54-88-245-107.compute-1.amazonaws.com,172.31.30.81'
(ECDSA) to the list of known hosts.
ec2-54-172-29-47.compute-1.amazonaws.com
id_rsa

 100% 1679 1.6KB/s   00:00

Is there a test like MiniCluster example in Spark just like hadoop ?

2016-01-18 Thread zml张明磊
Hello,

   I want to find some test file in spark which support the same function 
just like in Hadoop MiniCluster test environment. But I can not find them. 
Anyone know about that ?


Re: Spark Streaming on mesos

2016-01-18 Thread Iulian Dragoș
On Mon, Nov 30, 2015 at 4:09 PM, Renjie Liu  wrote:

> Hi, Lulian:
>

Please, it's Iulian, not Lulian.


> Are you sure that it'll be a long running process in fine-grained mode? I
> think you have a misunderstanding about it. An executor will be launched
> for some tasks, but not a long running process. When a group of tasks
> finished, it will get shutdown.
>

Sorry I missed your answer. Yes, I'm pretty sure, and if you have SSH
access to one of the slaves it's pretty easy to check. What makes you think
otherwise?


>
> On Mon, Nov 30, 2015 at 6:25 PM Iulian Dragoș 
> wrote:
>
>> Hi,
>>
>> Latency isn't such a big issue as it sounds. Did you try it out and
>> failed some performance metrics?
>>
>> In short, the *Mesos* executor on a given slave is going to be
>> long-running (consuming memory, but no CPUs). Each Spark task will be
>> scheduled using Mesos CPU resources, but they don't suffer much latency.
>>
>> iulian
>>
>>
>> On Mon, Nov 30, 2015 at 4:17 AM, Renjie Liu 
>> wrote:
>>
>>> Hi, Tim:
>>> Fine grain mode is not suitable for streaming applications since it need
>>> to start up an executor each time. When will the revamp get release? In the
>>> coming 1.6.0?
>>>
>>> On Sun, Nov 29, 2015 at 6:16 PM Timothy Chen  wrote:
>>>
 Hi Renjie,

 You can set number of cores per executor with spark executor cores in
 fine grain mode.

 If you want coarse grain mode to support that it will
 Be supported in the near term as he coarse grain scheduler is getting
 revamped now.

 Tim

 On Nov 28, 2015, at 7:31 PM, Renjie Liu 
 wrote:

 Hi, Nagaraj:
  Thanks for the response, but this does not solve my problem.
 I think executor memory should be proportional to number of cores, or
 number of core
 in each executor should be the same.
 On Sat, Nov 28, 2015 at 1:48 AM Nagaraj Chandrashekar <
 nchandrashe...@innominds.com> wrote:

> Hi Renjie,
>
> I have not setup Spark Streaming on Mesos but there is something
> called reservations in Mesos.  It supports both Static and Dynamic
> reservations.  Both types of reservations must have role defined. You may
> want to explore these options.   Excerpts from the Apache Mesos
> documentation.
>
> Cheers
> Nagaraj C
> Reservation
>
> Mesos provides mechanisms to reserve resources in specific slaves.
> The concept was first introduced with static reservation in 0.14.0
> which enabled operators to specify the reserved resources on slave 
> startup.
> This was extended with dynamic reservation in 0.23.0 which enabled
> operators and authorized frameworks to dynamically reserve resources
> in the cluster.
>
> No breaking changes were introduced with dynamic reservation, which
> means the existing static reservation mechanism continues to be fully
> supported.
>
> In both types of reservations, resources are reserved for a role.
> Static Reservation (since 0.14.0)
>
> An operator can configure a slave with resources reserved for a role.
> The reserved resources are specified via the --resources flag. For
> example, suppose we have 12 CPUs and 6144 MB of RAM available on a slave
> and that we want to reserve 8 CPUs and 4096 MB of RAM for the ads role.
> We start the slave like so:
>
> $ mesos-slave \
>   --master=: \
>   --resources="cpus:4;mem:2048;cpus(ads):8;mem(ads):4096"
>
> We now have 8 CPUs and 4096 MB of RAM reserved for ads on this slave.
>
>
> From: Renjie Liu 
> Date: Friday, November 27, 2015 at 9:57 PM
> To: "user@spark.apache.org" 
> Subject: Spark Streaming on mesos
>
> Hi, all:
> I'm trying to run spark streaming on mesos and it seems that none of
> the scheduler is suitable for that. Fine grain scheduler will start an
> executor for each task so it will significantly increase the latency. 
> While
> coarse grained mode can only set the max core numbers and executor memory
> but there's no way to set the number of cores for each executor. Has 
> anyone
> deployed spark streaming on mesos? And what's your settings?
> --
> Liu, Renjie
> Software Engineer, MVAD
>
 --
 Liu, Renjie
 Software Engineer, MVAD

 --
>>> Liu, Renjie
>>> Software Engineer, MVAD
>>>
>>
>>
>>
>> --
>>
>> --
>> Iulian Dragos
>>
>> --
>> Reactive Apps on the JVM
>> www.typesafe.com
>>
>> --
> Liu, Renjie
> Software Engineer, MVAD
>



-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com


Extracting p values in Logistic regression using mllib scala

2016-01-18 Thread Chandan Verma
Hi,

 

Can anyone help me to extract p-values from a logistic regression model
using mllib and scala.

 

Thanks

Chandan Verma




===
DISCLAIMER:
The information contained in this message (including any attachments) is 
confidential and may be privileged. If you have received it by mistake please 
notify the sender by return e-mail and permanently delete this message and any 
attachments from your system. Any dissemination, use, review, distribution, 
printing or copying of this message in whole or in part is strictly prohibited. 
Please note that e-mails are susceptible to change. CitiusTech shall not be 
liable for the improper or incomplete transmission of the information contained 
in this communication nor for any delay in its receipt or damage to your 
system. CitiusTech does not guarantee that the integrity of this communication 
has been maintained or that this communication is free of viruses, 
interceptions or interferences. 



[Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Jerry Lam
Hi spark users and developers,

what do you do if you want the from_unixtime function in spark sql to
return the timezone you want instead of the system timezone?

Best Regards,

Jerry


Re: simultaneous actions

2016-01-18 Thread Debasish Das
Simultaneous action works on cluster fine if they are independent...on
local I never paid attention but the code path should be similar...
On Jan 18, 2016 8:00 AM, "Koert Kuipers"  wrote:

> stacktrace? details?
>
> On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom 
> wrote:
>
>> Hi,
>>
>> I am running my app in a single machine first before moving it in the
>> cluster; actually simultaneous actions are not working for me now; is this
>> comming from the fact that I am using a single machine ? yet I am using
>> FAIR scheduler.
>>
>> 2016-01-17 21:23 GMT+01:00 Mark Hamstra :
>>
>>> It can be far more than that (e.g.
>>> https://issues.apache.org/jira/browse/SPARK-11838), and is generally
>>> either unrecognized or a greatly under-appreciated and underused feature of
>>> Spark.
>>>
>>> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers 
>>> wrote:
>>>
 the re-use of shuffle files is always a nice surprise to me

 On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra 
 wrote:

> Same SparkContext means same pool of Workers.  It's up to the
> Scheduler, not the SparkContext, whether the exact same Workers or
> Executors will be used to calculate simultaneous actions against the same
> RDD.  It is likely that many of the same Workers and Executors will be 
> used
> as the Scheduler tries to preserve data locality, but that is not
> guaranteed.  In fact, what is most likely to happen is that the shared
> Stages and Tasks being calculated for the simultaneous actions will not
> actually be run at exactly the same time, which means that shuffle files
> produced for one action will be reused by the other(s), and repeated
> calculations will be avoided even without explicitly caching/persisting 
> the
> RDD.
>
> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers 
> wrote:
>
>> Same rdd means same sparkcontext means same workers
>>
>> Cache/persist the rdd to avoid repeated jobs
>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" 
>> wrote:
>>
>>> Hi,
>>>
>>> Thank you all for your answers,
>>>
>>> If I correctly understand, actions (in my case foreach) can be run
>>> concurrently and simultaneously on the SAME rdd, (which is logical 
>>> because
>>> they are read only object). however, I want to know if the same workers 
>>> are
>>> used for the concurrent analysis ?
>>>
>>> Thank you
>>>
>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky :
>>>
 I stand corrected. How considerable are the benefits though? Will
 the scheduler be able to dispatch jobs from both actions 
 simultaneously (or
 on a when-workers-become-available basis)?

 On 15 January 2016 at 11:44, Koert Kuipers 
 wrote:

> we run multiple actions on the same (cached) rdd all the time, i
> guess in different threads indeed (its in akka)
>
> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
> matei.zaha...@gmail.com> wrote:
>
>> RDDs actually are thread-safe, and quite a few applications use
>> them this way, e.g. the JDBC server.
>>
>> Matei
>>
>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky 
>> wrote:
>>
>> I don't think RDDs are threadsafe.
>> More fundamentally however, why would you want to run RDD actions
>> in parallel? The idea behind RDDs is to provide you with an 
>> abstraction for
>> computing parallel operations on distributed data. Even if you were 
>> to call
>> actions from several threads at once, the individual executors of 
>> your
>> spark environment would still have to perform operations 
>> sequentially.
>>
>> As an alternative, I would suggest to restructure your RDD
>> transformations to compute the required results in one single 
>> operation.
>>
>> On 15 January 2016 at 06:18, Jonathan Coveney > > wrote:
>>
>>> Threads
>>>
>>>
>>> El viernes, 15 de enero de 2016, Kira 
>>> escribió:
>>>
 Hi,

 Can we run *simultaneous* actions on the *same RDD* ?; if yes
 how can this
 be done ?

 Thank you,
 Regards



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com 

Spark SQL create table

2016-01-18 Thread raghukiran
Is creating a table using the SparkSQLContext currently supported?

Regards,
Raghu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-create-table-tp25996.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Calling SparkContext methods in scala Future

2016-01-18 Thread Ted Yu
  externalCallTwo map { dataTwo =>
println("in map") // prints, so it gets here ...
val rddOne = sparkContext.parallelize(dataOne)

I don't think you should call method on sparkContext in map function.
sparkContext lives on driver side.

Cheers

On Mon, Jan 18, 2016 at 6:27 AM, Marco  wrote:

> Hello,
>
> I am using Spark 1.5.1 within SBT, and Scala 2.10.6 and I am facing an
> issue with the SparkContext.
>
> Basically, I have an object that needs to do several things:
>
> - call an external service One (web api)
> - call an external service Two (another api)
> - read and produce an RDD from HDFS (Spark)
> - parallelize the data obtained in the first two calls
> - join these different rdds, do stuff with them...
>
> Now, I am trying to do it in an asynchronous way. This doesn't seem to
> work, though. My guess is that Spark doesn't see the calls to .parallelize,
> as they are made in different tasks (or Future, therefore this code is
> called before/later or maybe with an unset Context (can it be?)). I have
> tried different ways, one of these being the call to SparkEnv.set in the
> calls to flatMap and map (in the Future). However, all I get is Cannot call
> methods on a stopped SparkContext. It just doesnt'work - maybe I just
> misunderstood what it does, therefore I removed it.
>
> This is the code I have written so far:
>
> object Fetcher {
>
>   def fetch(name, master, ...) = {
> val externalCallOne: Future[WSResponse] = externalService1()
> val externalCallTwo: Future[String] = externalService2()
> // val sparkEnv = SparkEnv.get
> val config = new SparkConf()
> .setAppName(name)
> .set("spark.master", master)
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>
> val sparkContext = new SparkContext(config)
> //val sparkEnv = SparkEnv.get
>
> val eventuallyJoinedData = externalCallOne flatMap { dataOne =>
>   // SparkEnv.set(sparkEnv)
>   externalCallTwo map { dataTwo =>
> println("in map") // prints, so it gets here ...
> val rddOne = sparkContext.parallelize(dataOne)
> val rddTwo = sparkContext.parallelize(dataTwo)
> // do stuff here ... foreach/println, and
>
> val joinedData = rddOne leftOuterJoin (rddTwo)
>   }
> }
> eventuallyJoinedData onSuccess { case success => ...  }
> eventuallyJoinedData onFailure { case error =>
> println(error.getMessage) }
> // sparkContext.stop
>   }
>
> }
> As you can see, I have also tried to comment the line to stop the context,
> but then I get another issue:
>
> 13:09:14.929 [ForkJoinPool-1-worker-5] INFO  org.apache.spark.SparkContext
> - Starting job: count at Fetcher.scala:38
> 13:09:14.932 [shuffle-server-0] DEBUG io.netty.channel.nio.NioEventLoop -
> Selector.select() returned prematurely because
> Thread.currentThread().interrupt() was called. Use
> NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
> 13:09:14.936 [Spark Context Cleaner] ERROR org.apache.spark.ContextCleaner
> - Error in cleaning thread
> java.lang.InterruptedException: null
> at java.lang.Object.wait(Native Method) ~[na:1.8.0_65]
> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
> ~[na:1.8.0_65]
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:157)
> ~[spark-core_2.10-1.5.1.jar:1.5.1]
> at
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
> [spark-core_2.10-1.5.1.jar:1.5.1]
> at 
> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154)
> [spark-core_2.10-1.5.1.jar:1.5.1]
> at
> org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67)
> [spark-core_2.10-1.5.1.jar:1.5.1]
> 13:09:14.940 [db-async-netty-thread-1] DEBUG
> io.netty.channel.nio.NioEventLoop - Selector.select() returned prematurely
> because Thread.currentThread().interrupt() was called. Use
> NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
> 13:09:14.943 [SparkListenerBus] ERROR org.apache.spark.util.Utils -
> uncaught error in thread SparkListenerBus, stopping SparkContext
> java.lang.InterruptedException: null
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
> ~[na:1.8.0_65]
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> ~[na:1.8.0_65]
> at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
> ~[na:1.8.0_65]
> at
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:65)
> ~[spark-core_2.10-1.5.1.jar:1.5.1]
> at
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
> ~[spark-core_2.10-1.5.1.jar:1.5.1]
> at
> 

Re: Serializing DataSets

2016-01-18 Thread Michael Armbrust
What error?

On Mon, Jan 18, 2016 at 9:01 AM, Simon Hafner  wrote:

> And for deserializing,
> `sqlContext.read.parquet("path/to/parquet").as[T]` and catch the
> error?
>
> 2016-01-14 3:43 GMT+08:00 Michael Armbrust :
> > Yeah, thats the best way for now (note the conversion is purely logical
> so
> > there is no cost of calling toDF()).  We'll likely be combining the
> classes
> > in Spark 2.0 to remove this awkwardness.
> >
> > On Tue, Jan 12, 2016 at 11:20 PM, Simon Hafner 
> > wrote:
> >>
> >> What's the proper way to write DataSets to disk? Convert them to a
> >> DataFrame and use the writers there?
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>


Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Daniel Darabos
On Mon, Jan 18, 2016 at 5:24 PM, Oleg Ruchovets 
wrote:

> I thought script tries to install hadoop / hdfs also. And it looks like it
> failed. Installation is only standalone spark without hadoop. Is it correct
> behaviour?
>

Yes, it also sets up two HDFS clusters. Are they not working? Try to see if
Spark is working by running some simple jobs on it. (See
http://spark.apache.org/docs/latest/ec2-scripts.html.)

There is no program called Hadoop. If you mean YARN, then indeed the script
does not set up YARN. It sets up standalone Spark.


> Also errors in the log:
>ERROR: Unknown Tachyon version
>Error: Could not find or load main class crayondata.com.log
>

As long as Spark is working fine, you can ignore all output from the EC2
script :).


Re: Spark SQL create table

2016-01-18 Thread Ted Yu
Have you taken a look
at sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala ?

You can find examples there.

On Mon, Jan 18, 2016 at 9:57 AM, raghukiran  wrote:

> Is creating a table using the SparkSQLContext currently supported?
>
> Regards,
> Raghu
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-create-table-tp25996.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL create table

2016-01-18 Thread Ted Yu
Please take a look
at sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDataFrameSuite.scala

On Mon, Jan 18, 2016 at 9:57 AM, raghukiran  wrote:

> Is creating a table using the SparkSQLContext currently supported?
>
> Regards,
> Raghu
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-create-table-tp25996.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark SQL create table

2016-01-18 Thread Raghu Ganti
This requires Hive to be installed and uses HiveContext, right?

What is the SparkSQLContext useful for?

On Mon, Jan 18, 2016 at 1:27 PM, Ted Yu  wrote:

> Please take a look
> at sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDataFrameSuite.scala
>
> On Mon, Jan 18, 2016 at 9:57 AM, raghukiran  wrote:
>
>> Is creating a table using the SparkSQLContext currently supported?
>>
>> Regards,
>> Raghu
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-create-table-tp25996.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark SQL create table

2016-01-18 Thread Raghu Ganti
Btw, Thanks a lot for all your quick responses - it is very useful and
definitely appreciate it :-)

On Mon, Jan 18, 2016 at 1:28 PM, Raghu Ganti  wrote:

> This requires Hive to be installed and uses HiveContext, right?
>
> What is the SparkSQLContext useful for?
>
> On Mon, Jan 18, 2016 at 1:27 PM, Ted Yu  wrote:
>
>> Please take a look
>> at sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDataFrameSuite.scala
>>
>> On Mon, Jan 18, 2016 at 9:57 AM, raghukiran  wrote:
>>
>>> Is creating a table using the SparkSQLContext currently supported?
>>>
>>> Regards,
>>> Raghu
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-create-table-tp25996.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Spark SQL create table

2016-01-18 Thread Ted Yu
By SparkSQLContext, I assume you mean SQLContext.
>From the doc for SQLContext#createDataFrame():

   *  dataFrame.registerTempTable("people")
   *  sqlContext.sql("select name from people").collect.foreach(println)

If you want to persist table externally, you need Hive, etc

Regards

On Mon, Jan 18, 2016 at 10:28 AM, Raghu Ganti  wrote:

> This requires Hive to be installed and uses HiveContext, right?
>
> What is the SparkSQLContext useful for?
>
> On Mon, Jan 18, 2016 at 1:27 PM, Ted Yu  wrote:
>
>> Please take a look
>> at sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDataFrameSuite.scala
>>
>> On Mon, Jan 18, 2016 at 9:57 AM, raghukiran  wrote:
>>
>>> Is creating a table using the SparkSQLContext currently supported?
>>>
>>> Regards,
>>> Raghu
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-create-table-tp25996.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Calling SparkContext methods in scala Future

2016-01-18 Thread Shixiong(Ryan) Zhu
Hey Marco,

Since the codes in Future is in an asynchronous way, you cannot call
"sparkContext.stop" at the end of "fetch" because the codes in Future may
not finish.

However, the exception seems weird. Do you have a simple reproducer?


On Mon, Jan 18, 2016 at 9:13 AM, Ted Yu  wrote:

>   externalCallTwo map { dataTwo =>
> println("in map") // prints, so it gets here ...
> val rddOne = sparkContext.parallelize(dataOne)
>
> I don't think you should call method on sparkContext in map function.
> sparkContext lives on driver side.
>
> Cheers
>
> On Mon, Jan 18, 2016 at 6:27 AM, Marco  wrote:
>
>> Hello,
>>
>> I am using Spark 1.5.1 within SBT, and Scala 2.10.6 and I am facing an
>> issue with the SparkContext.
>>
>> Basically, I have an object that needs to do several things:
>>
>> - call an external service One (web api)
>> - call an external service Two (another api)
>> - read and produce an RDD from HDFS (Spark)
>> - parallelize the data obtained in the first two calls
>> - join these different rdds, do stuff with them...
>>
>> Now, I am trying to do it in an asynchronous way. This doesn't seem to
>> work, though. My guess is that Spark doesn't see the calls to .parallelize,
>> as they are made in different tasks (or Future, therefore this code is
>> called before/later or maybe with an unset Context (can it be?)). I have
>> tried different ways, one of these being the call to SparkEnv.set in the
>> calls to flatMap and map (in the Future). However, all I get is Cannot call
>> methods on a stopped SparkContext. It just doesnt'work - maybe I just
>> misunderstood what it does, therefore I removed it.
>>
>> This is the code I have written so far:
>>
>> object Fetcher {
>>
>>   def fetch(name, master, ...) = {
>> val externalCallOne: Future[WSResponse] = externalService1()
>> val externalCallTwo: Future[String] = externalService2()
>> // val sparkEnv = SparkEnv.get
>> val config = new SparkConf()
>> .setAppName(name)
>> .set("spark.master", master)
>> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>>
>> val sparkContext = new SparkContext(config)
>> //val sparkEnv = SparkEnv.get
>>
>> val eventuallyJoinedData = externalCallOne flatMap { dataOne =>
>>   // SparkEnv.set(sparkEnv)
>>   externalCallTwo map { dataTwo =>
>> println("in map") // prints, so it gets here ...
>> val rddOne = sparkContext.parallelize(dataOne)
>> val rddTwo = sparkContext.parallelize(dataTwo)
>> // do stuff here ... foreach/println, and
>>
>> val joinedData = rddOne leftOuterJoin (rddTwo)
>>   }
>> }
>> eventuallyJoinedData onSuccess { case success => ...  }
>> eventuallyJoinedData onFailure { case error =>
>> println(error.getMessage) }
>> // sparkContext.stop
>>   }
>>
>> }
>> As you can see, I have also tried to comment the line to stop the
>> context, but then I get another issue:
>>
>> 13:09:14.929 [ForkJoinPool-1-worker-5] INFO
>>  org.apache.spark.SparkContext - Starting job: count at Fetcher.scala:38
>> 13:09:14.932 [shuffle-server-0] DEBUG io.netty.channel.nio.NioEventLoop -
>> Selector.select() returned prematurely because
>> Thread.currentThread().interrupt() was called. Use
>> NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
>> 13:09:14.936 [Spark Context Cleaner] ERROR
>> org.apache.spark.ContextCleaner - Error in cleaning thread
>> java.lang.InterruptedException: null
>> at java.lang.Object.wait(Native Method) ~[na:1.8.0_65]
>> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
>> ~[na:1.8.0_65]
>> at
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:157)
>> ~[spark-core_2.10-1.5.1.jar:1.5.1]
>> at
>> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
>> [spark-core_2.10-1.5.1.jar:1.5.1]
>> at 
>> org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154)
>> [spark-core_2.10-1.5.1.jar:1.5.1]
>> at
>> org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67)
>> [spark-core_2.10-1.5.1.jar:1.5.1]
>> 13:09:14.940 [db-async-netty-thread-1] DEBUG
>> io.netty.channel.nio.NioEventLoop - Selector.select() returned prematurely
>> because Thread.currentThread().interrupt() was called. Use
>> NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.
>> 13:09:14.943 [SparkListenerBus] ERROR org.apache.spark.util.Utils -
>> uncaught error in thread SparkListenerBus, stopping SparkContext
>> java.lang.InterruptedException: null
>> at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
>> ~[na:1.8.0_65]
>> at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>> ~[na:1.8.0_65]
>> at 

Re: Spark Streaming: Does mapWithState implicitly partition the dsteram?

2016-01-18 Thread Shixiong(Ryan) Zhu
mapWithState uses HashPartitioner by default. You can use
"StateSpec.partitioner" to set your custom partitioner.

On Sun, Jan 17, 2016 at 11:00 AM, Lin Zhao  wrote:

> When the state is passed to the task that handles a mapWithState for a
> particular key, if the key is distributed, it seems extremely difficult to
> coordinate and synchronise the state. Is there a partition by key before a
> mapWithState? If not what exactly is the execution model?
>
> Thanks,
>
> Lin
>
>


Spark Summit East - Full Schedule Available

2016-01-18 Thread Scott walent
Join the Apache Spark community at the 2nd annual Spark Summit East from
February 16-18, 2016 in New York City.

We will kick things off with a Spark update from Matei Zaharia followed by
over 60 talks that were selected by the program committee. The agenda this
year includes enterprise talks from Microsoft, Bloomberg and Comcast as
well as the popular developer, data science, research and application
tracks.  See the full agenda at https://spark-summit.org/east-2016/schedule.


If you are new to Spark or looking to improve on your knowledge of the
technology, we are offering three levels of Spark Training: Spark
Essentials, Advanced Exploring Wikipedia with Spark, and Data Science with
Spark. Visit https://spark-summit.org/east-2016/schedule/spark-training for
details.

Space is limited and we anticipate selling out, so register now! Use promo
code "ApacheListEast" to save 20% when registering before January 29, 2016.
Register at https://spark-summit.org/register.

We look forward to seeing you there.

Scott and the Summit Organizers


How to call a custom function from GroupByKey which takes Iterable[Row] as input and returns a Map[Int,String] as output in scala

2016-01-18 Thread Neha Mehta
Hi,

I have a scenario wherein my dataset has around 30 columns. It is basically
user activity information. I need to group the information by each user and
then for each column/activity parameter I need to find the percentage
affinity for each value in that column for that user. Below is the sample
input and output.

UserId C1 C2 C3
1 A <20 0
1 A >20 & <40 1
1 B >20 & <40 0
1 C >20 & <40 0
1 C >20 & <40 0
2 A <20 0
3 B >20 & <40 1
3 B >40 2








Output


1 A:0.4|B:0.2|C:0.4 <20:02|>20 & <40:0.8 0:0.8|1:0.2
2 A:1 <20:1 0:01
3 B:1 >20 & <40:0.5|>40:0.5 1:0.5|2:0.5

Presently this is how I am calculating these values:
Group by UserId and C1 and compute values for C1 for all the users, then do
a group by by Userid and C2 and find the fractions for C2 for each user and
so on. This approach is quite slow.  Also the number of records for each
user will be at max 30. So I would like to take a second approach wherein I
do a groupByKey and pass the entire list of records for each key to a
function which computes all the percentages for each column for each user
at once. Below are the steps I am trying to follow:

1. Dataframe1 => group by UserId , find the counts of records for each
user. Join the results back to the input so that counts are available with
each record
2. Dataframe1.map(s=>s(1),s).groupByKey().map(s=>myUserAggregator(s._2))

def myUserAggregator(rows: Iterable[Row]):
scala.collection.mutable.Map[Int,String] = {
val returnValue = scala.collection.mutable.Map[Int,String]()
if (rows != null) {
  val activityMap = scala.collection.mutable.Map[Int,
scala.collection.mutable.Map[String,
Int]]().withDefaultValue(scala.collection.mutable.Map[String,
Int]().withDefaultValue(0))

  val rowIt = rows.iterator
  var sentCount = 1
  for (row <- rowIt) {
sentCount = row(29).toString().toInt
for (i <- 0 until row.length) {
  var m = activityMap(i)
  if (activityMap(i) == null) {
m = collection.mutable.Map[String,
Int]().withDefaultValue(0)
  }
  m(row(i).toString()) += 1
  activityMap.update(i, m)
}
  }
  var activityPPRow: Row = Row()
  for((k,v) <- activityMap) {
  var rowVal:String = ""
  for((a,b) <- v) {
rowVal += rowVal + a + ":" + b/sentCount + "|"
  }
  returnValue.update(k, rowVal)
//  activityPPRow.apply(k) = rowVal
  }

}
return returnValue
  }

When I run step 2 I get the following error. I am new to Scala and Spark
and am unable to figure out how to pass the Iterable[Row] to a function and
get back the results.

org.apache.spark.SparkException: Task not serializable
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.map(RDD.scala:317)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:97)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:102)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:104)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:106)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:108)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:110)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:112)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:114)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:116)
..


Thanks for the help.

Regards,
Neha Mehta


Re: simultaneous actions

2016-01-18 Thread Mennour Rostom
Hi,

I am running my app in a single machine first before moving it in the
cluster; actually simultaneous actions are not working for me now; is this
comming from the fact that I am using a single machine ? yet I am using
FAIR scheduler.

2016-01-17 21:23 GMT+01:00 Mark Hamstra :

> It can be far more than that (e.g.
> https://issues.apache.org/jira/browse/SPARK-11838), and is generally
> either unrecognized or a greatly under-appreciated and underused feature of
> Spark.
>
> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers  wrote:
>
>> the re-use of shuffle files is always a nice surprise to me
>>
>> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra 
>> wrote:
>>
>>> Same SparkContext means same pool of Workers.  It's up to the Scheduler,
>>> not the SparkContext, whether the exact same Workers or Executors will be
>>> used to calculate simultaneous actions against the same RDD.  It is likely
>>> that many of the same Workers and Executors will be used as the Scheduler
>>> tries to preserve data locality, but that is not guaranteed.  In fact, what
>>> is most likely to happen is that the shared Stages and Tasks being
>>> calculated for the simultaneous actions will not actually be run at exactly
>>> the same time, which means that shuffle files produced for one action will
>>> be reused by the other(s), and repeated calculations will be avoided even
>>> without explicitly caching/persisting the RDD.
>>>
>>> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers 
>>> wrote:
>>>
 Same rdd means same sparkcontext means same workers

 Cache/persist the rdd to avoid repeated jobs
 On Jan 17, 2016 5:21 AM, "Mennour Rostom"  wrote:

> Hi,
>
> Thank you all for your answers,
>
> If I correctly understand, actions (in my case foreach) can be run
> concurrently and simultaneously on the SAME rdd, (which is logical because
> they are read only object). however, I want to know if the same workers 
> are
> used for the concurrent analysis ?
>
> Thank you
>
> 2016-01-15 21:11 GMT+01:00 Jakob Odersky :
>
>> I stand corrected. How considerable are the benefits though? Will the
>> scheduler be able to dispatch jobs from both actions simultaneously (or 
>> on
>> a when-workers-become-available basis)?
>>
>> On 15 January 2016 at 11:44, Koert Kuipers  wrote:
>>
>>> we run multiple actions on the same (cached) rdd all the time, i
>>> guess in different threads indeed (its in akka)
>>>
>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
>>> matei.zaha...@gmail.com> wrote:
>>>
 RDDs actually are thread-safe, and quite a few applications use
 them this way, e.g. the JDBC server.

 Matei

 On Jan 15, 2016, at 2:10 PM, Jakob Odersky 
 wrote:

 I don't think RDDs are threadsafe.
 More fundamentally however, why would you want to run RDD actions
 in parallel? The idea behind RDDs is to provide you with an 
 abstraction for
 computing parallel operations on distributed data. Even if you were to 
 call
 actions from several threads at once, the individual executors of your
 spark environment would still have to perform operations sequentially.

 As an alternative, I would suggest to restructure your RDD
 transformations to compute the required results in one single 
 operation.

 On 15 January 2016 at 06:18, Jonathan Coveney 
 wrote:

> Threads
>
>
> El viernes, 15 de enero de 2016, Kira 
> escribió:
>
>> Hi,
>>
>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how
>> can this
>> be done ?
>>
>> Thank you,
>> Regards
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com .
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


>>>
>>
>
>>>
>>
>


Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Oleg Ruchovets
I thought script tries to install hadoop / hdfs also. And it looks like it
failed. Installation is only standalone spark without hadoop. Is it correct
behaviour?
Also errors in the log:
   ERROR: Unknown Tachyon version
   Error: Could not find or load main class crayondata.com.log

Thanks
Oleg.


Re: simultaneous actions

2016-01-18 Thread Koert Kuipers
stacktrace? details?

On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom  wrote:

> Hi,
>
> I am running my app in a single machine first before moving it in the
> cluster; actually simultaneous actions are not working for me now; is this
> comming from the fact that I am using a single machine ? yet I am using
> FAIR scheduler.
>
> 2016-01-17 21:23 GMT+01:00 Mark Hamstra :
>
>> It can be far more than that (e.g.
>> https://issues.apache.org/jira/browse/SPARK-11838), and is generally
>> either unrecognized or a greatly under-appreciated and underused feature of
>> Spark.
>>
>> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers 
>> wrote:
>>
>>> the re-use of shuffle files is always a nice surprise to me
>>>
>>> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra 
>>> wrote:
>>>
 Same SparkContext means same pool of Workers.  It's up to the
 Scheduler, not the SparkContext, whether the exact same Workers or
 Executors will be used to calculate simultaneous actions against the same
 RDD.  It is likely that many of the same Workers and Executors will be used
 as the Scheduler tries to preserve data locality, but that is not
 guaranteed.  In fact, what is most likely to happen is that the shared
 Stages and Tasks being calculated for the simultaneous actions will not
 actually be run at exactly the same time, which means that shuffle files
 produced for one action will be reused by the other(s), and repeated
 calculations will be avoided even without explicitly caching/persisting the
 RDD.

 On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers 
 wrote:

> Same rdd means same sparkcontext means same workers
>
> Cache/persist the rdd to avoid repeated jobs
> On Jan 17, 2016 5:21 AM, "Mennour Rostom"  wrote:
>
>> Hi,
>>
>> Thank you all for your answers,
>>
>> If I correctly understand, actions (in my case foreach) can be run
>> concurrently and simultaneously on the SAME rdd, (which is logical 
>> because
>> they are read only object). however, I want to know if the same workers 
>> are
>> used for the concurrent analysis ?
>>
>> Thank you
>>
>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky :
>>
>>> I stand corrected. How considerable are the benefits though? Will
>>> the scheduler be able to dispatch jobs from both actions simultaneously 
>>> (or
>>> on a when-workers-become-available basis)?
>>>
>>> On 15 January 2016 at 11:44, Koert Kuipers 
>>> wrote:
>>>
 we run multiple actions on the same (cached) rdd all the time, i
 guess in different threads indeed (its in akka)

 On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
 matei.zaha...@gmail.com> wrote:

> RDDs actually are thread-safe, and quite a few applications use
> them this way, e.g. the JDBC server.
>
> Matei
>
> On Jan 15, 2016, at 2:10 PM, Jakob Odersky 
> wrote:
>
> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions
> in parallel? The idea behind RDDs is to provide you with an 
> abstraction for
> computing parallel operations on distributed data. Even if you were 
> to call
> actions from several threads at once, the individual executors of your
> spark environment would still have to perform operations sequentially.
>
> As an alternative, I would suggest to restructure your RDD
> transformations to compute the required results in one single 
> operation.
>
> On 15 January 2016 at 06:18, Jonathan Coveney 
> wrote:
>
>> Threads
>>
>>
>> El viernes, 15 de enero de 2016, Kira 
>> escribió:
>>
>>> Hi,
>>>
>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes
>>> how can this
>>> be done ?
>>>
>>> Thank you,
>>> Regards
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com .
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
>

>>>
>>

>>>
>>
>


spark random forest regressor : argument minInstancesPerNode not accepted

2016-01-18 Thread Christopher Bourez
Dears,

I'm trying to set the parameter 'minInstancesPerNode', it sounds like it is
not working :

model = RandomForest.trainRegressor(trainingData,
categoricalFeaturesInfo={},numTrees=2500,
featureSubsetStrategy="sqrt",impurity='variance',minInstancesPerNode=1000)

Traceback (most recent call last):

  File "", line 1, in 

TypeError: trainRegressor() got an unexpected keyword argument
'minInstancesPerNode'

Thanks for your help

Christopher Bourez
06 17 17 50 60


Re: Serializing DataSets

2016-01-18 Thread Simon Hafner
And for deserializing,
`sqlContext.read.parquet("path/to/parquet").as[T]` and catch the
error?

2016-01-14 3:43 GMT+08:00 Michael Armbrust :
> Yeah, thats the best way for now (note the conversion is purely logical so
> there is no cost of calling toDF()).  We'll likely be combining the classes
> in Spark 2.0 to remove this awkwardness.
>
> On Tue, Jan 12, 2016 at 11:20 PM, Simon Hafner 
> wrote:
>>
>> What's the proper way to write DataSets to disk? Convert them to a
>> DataFrame and use the writers there?
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Incorrect timeline for Scheduling Delay in Streaming page in web UI?

2016-01-18 Thread Shixiong(Ryan) Zhu
Hey, did you mean that the scheduling delay timeline is incorrect because
it's too short and some values are missing? A batch won't have a scheduling
delay until it starts to run. In your example, a lot of batches are waiting
so that they don't have the scheduling delay.

On Sun, Jan 17, 2016 at 4:49 AM, Jacek Laskowski  wrote:

> Hi,
>
> I'm trying to understand how Scheduling Delays are displayed in
> Streaming page in web UI and think the values are displayed
> incorrectly in the Timelines column. I'm only concerned with the
> scheduling delays (on y axis) per batch times (x axis). It appears
> that the values (on y axis) are correct, but not how they are
> displayed per batch times.
>
> See the second screenshot in
>
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-streaming-webui.html#scheduling-delay
> .
>
> Can anyone explain how the delays for batches per batch time should be
> read? I'm specifically asking about the timeline (not histogram as it
> seems fine).
>
> Pozdrawiam,
> Jacek
>
> Jacek Laskowski | https://medium.com/@jaceklaskowski/
> Mastering Apache Spark
> ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: [Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Alexander Pivovarov
Look at
to_utc_timestamp

from_utc_timestamp
On Jan 18, 2016 9:39 AM, "Jerry Lam"  wrote:

> Hi spark users and developers,
>
> what do you do if you want the from_unixtime function in spark sql to
> return the timezone you want instead of the system timezone?
>
> Best Regards,
>
> Jerry
>


PySpark Broadcast of User Defined Class No Work?

2016-01-18 Thread efwalkermit
Should I be able to broadcast a fairly simple user-defined class?  I'm having
no success in 1.6.0 (or 1.5.2):

$ cat test_spark.py
import pyspark


class SimpleClass:
def __init__(self):
self.val = 5
def get(self):
return self.val


def main():
sc = pyspark.SparkContext()
b = sc.broadcast(SimpleClass())
results = sc.parallelize([ x for x in range(10) ]).map(lambda x: x +
b.value.get()).collect() 

if __name__ == '__main__':
main()


$ spark-submit --master local[1] test_spark.py
[snip]
  File "/Users/ed/src/mrspark/examples/fortyler/test_spark.py", line 14, in

results = sc.parallelize([ x for x in range(10) ]).map(lambda x: x +
b.value.get()).collect()
  File
"/Users/ed/.spark/spark-1.5.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/broadcast.py",
line 97, in value
self._value = self.load(self._path)
  File
"/Users/ed/.spark/spark-1.5.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/broadcast.py",
line 88, in load
return pickle.load(f)
AttributeError: 'module' object has no attribute 'SimpleClass'
[snip]



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Broadcast-of-User-Defined-Class-No-Work-tp26000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark Broadcast of User Defined Class No Work?

2016-01-18 Thread Maciej Szymkiewicz
Python can pickle only objects not classes. It means that SimpleClass
has to importable on every worker node to enable correct
deserialization. Typically it means keeping class definitions in a
separate module and distributing using for example --py-files.


On 01/19/2016 12:34 AM, efwalkermit wrote:
> Should I be able to broadcast a fairly simple user-defined class?  I'm having
> no success in 1.6.0 (or 1.5.2):
>
> $ cat test_spark.py
> import pyspark
>
>
> class SimpleClass:
> def __init__(self):
> self.val = 5
> def get(self):
> return self.val
>
>
> def main():
> sc = pyspark.SparkContext()
> b = sc.broadcast(SimpleClass())
> results = sc.parallelize([ x for x in range(10) ]).map(lambda x: x +
> b.value.get()).collect() 
>
> if __name__ == '__main__':
> main()
>
>
> $ spark-submit --master local[1] test_spark.py
> [snip]
>   File "/Users/ed/src/mrspark/examples/fortyler/test_spark.py", line 14, in
> 
> results = sc.parallelize([ x for x in range(10) ]).map(lambda x: x +
> b.value.get()).collect()
>   File
> "/Users/ed/.spark/spark-1.5.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/broadcast.py",
> line 97, in value
> self._value = self.load(self._path)
>   File
> "/Users/ed/.spark/spark-1.5.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/broadcast.py",
> line 88, in load
> return pickle.load(f)
> AttributeError: 'module' object has no attribute 'SimpleClass'
> [snip]
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Broadcast-of-User-Defined-Class-No-Work-tp26000.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>




signature.asc
Description: OpenPGP digital signature


Re: trouble using eclipse to view spark source code

2016-01-18 Thread Jakob Odersky
Have you followed the guide on how to import spark into eclipse
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse
?

On 18 January 2016 at 13:04, Andy Davidson 
wrote:

> Hi
>
> My project is implemented using Java 8 and Python. Some times its handy to
> look at the spark source code. For unknown reason if I open a spark project
> my java projects show tons of compiler errors. I think it may have
> something to do with Scala. If I close the projects my java code is fine.
>
> I typically I only want to import the machine learning and streaming
> projects.
>
> I am not sure if this is an issue or not but my java projects are built
> using gradel
>
> In eclipse preferences -> scala -> installations I selected Scala: 2.10.6
> (built in)
>
> Any suggestions would be greatly appreciate
>
> Andy
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Spark + Sentry + Kerberos don't add up?

2016-01-18 Thread Ruslan Dautkhanov
Hi Romain,

Thank you for your response.

Adding Kerberos support might be as simple as
https://issues.cloudera.org/browse/LIVY-44 ? I.e. add Livy --principal and
--keytab parameters to be passed to spark-submit.

As a workaround I just did kinit (using hues' keytab) and then launched
Livy Server. It probably will work as long as kerberos ticket doesn't
expire. That's it would be great to have support for --principal and
--keytab parameters for spark-submit as explined in
http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html


The only problem I have currently is the above error stack in my previous
email:

The Spark session could not be created in the cluster:
> at org.apache.hadoop.security.*UserGroupInformation.doAs*(
> UserGroupInformation.java:1671)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(
> SparkSubmit.scala:160)



>> AFAIK Hive impersonation should be turned off when using Sentry

Yep, exactly. That's what I did. It is disabled now. But looks like on
other hand, Spark or Spark Notebook want to have that enabled?
It tries to do org.apache.hadoop.security.UserGroupInformation.doAs() hence
the error.

So Sentry isn't compatible with Spark in kerberized clusters? Is any
workaround for this problem?


-- 
Ruslan Dautkhanov

On Mon, Jan 18, 2016 at 3:52 PM, Romain Rigaux  wrote:

> Livy does not support any Kerberos yet
> https://issues.cloudera.org/browse/LIVY-3
>
> Are you focusing instead about HS2 + Kerberos with Sentry?
>
> AFAIK Hive impersonation should be turned off when using Sentry:
> http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/sg_sentry_service_config.html
>
> On Sun, Jan 17, 2016 at 10:04 PM, Ruslan Dautkhanov 
> wrote:
>
>> Getting following error stack
>>
>> The Spark session could not be created in the cluster:
>>> at org.apache.hadoop.security.*UserGroupInformation.doAs*
>>> (UserGroupInformation.java:1671)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) )
>>> at org.*apache.hadoop.hive.metastore.HiveMetaStoreClient*
>>> .open(HiveMetaStoreClient.java:466)
>>> at
>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:234)
>>> at
>>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>>> ... 35 more
>>
>>
>> My understanding that hive.server2.enable.impersonation and
>> hive.server2.enable.doAs should be enabled to make
>> UserGroupInformation.doAs() work?
>>
>> When I try to enable these parameters, Cloudera Manager shows error
>>
>> Hive Impersonation is enabled for Hive Server2 role 'HiveServer2
>>> (hostname)'.
>>> Hive Impersonation should be disabled to enable Hive authorization using
>>> Sentry
>>
>>
>> So Spark-Hive conflicts with Sentry!?
>>
>> Environment: Hue 3.9 Spark Notebooks + Livy Server (built from master).
>> CDH 5.5.
>>
>> This is a kerberized cluster with Sentry.
>>
>> I was using hue's keytab as hue user is normally (by default in CDH) is
>> allowed to impersonate to other users.
>> So very convenient for Spark Notebooks.
>>
>> Any information to help solve this will be highly appreciated.
>>
>>
>> --
>> Ruslan Dautkhanov
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Hue-Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to hue-user+unsubscr...@cloudera.org.
>>
>
>


Re: trouble using eclipse to view spark source code

2016-01-18 Thread Annabel Melongo
Andy,
This has nothing to do with Spark but I guess you don't have the proper Scala 
version. The version you're currently running doesn't recognize a method in 
Scala ArrayOps, namely:          scala.collection.mutable.ArrayOps.$colon$plus 

On Monday, January 18, 2016 7:53 PM, Andy Davidson 
 wrote:
 

 Many thanks. I was using a different scala plug in. this one seems to work 
better I no longer get compile error how ever I get the following stack trace 
when I try to run my unit tests with mllib open
I am still using eclipse luna.
Andy
java.lang.NoSuchMethodError: 
scala.collection.mutable.ArrayOps.$colon$plus(Ljava/lang/Object;Lscala/reflect/ClassTag;)Ljava/lang/Object;
 at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:73) at 
org.apache.spark.ml.feature.HashingTF.transformSchema(HashingTF.scala:76) at 
org.apache.spark.ml.feature.HashingTF.transform(HashingTF.scala:64) at 
com.pws.fantasySport.ml.TDIDFTest.runPipleLineTF_IDF(TDIDFTest.java:52) at 
com.pws.fantasySport.ml.TDIDFTest.test(TDIDFTest.java:36) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497) at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
 at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) 
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
 at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675)
 at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
 at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)

From:  Jakob Odersky 
Date:  Monday, January 18, 2016 at 3:20 PM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: trouble using eclipse to view spark source code


Have you followed the guide on how to import spark into eclipse 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse
 ?

On 18 January 2016 at 13:04, Andy Davidson  
wrote:

Hi 
My project is implemented using Java 8 and Python. Some times its handy to look 
at the spark source code. For unknown reason if I open a spark project my java 
projects show tons of compiler errors. I think it may have something to do with 
Scala. If I close the projects my java code is fine.
I typically I only want to import the machine learning and streaming projects.
I am not sure if this is an issue or not but my java projects are built using 
gradel
In eclipse preferences -> scala -> installations I selected Scala: 2.10.6 
(built in)
Any suggestions would be greatly appreciate
Andy






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

  
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: trouble using eclipse to view spark source code

2016-01-18 Thread Andy Davidson
Many thanks. I was using a different scala plug in. this one seems to work
better I no longer get compile error how ever I get the following stack
trace when I try to run my unit tests with mllib open

I am still using eclipse luna.

Andy

java.lang.NoSuchMethodError:
scala.collection.mutable.ArrayOps.$colon$plus(Ljava/lang/Object;Lscala/refle
ct/ClassTag;)Ljava/lang/Object;
at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:73)
at org.apache.spark.ml.feature.HashingTF.transformSchema(HashingTF.scala:76)
at org.apache.spark.ml.feature.HashingTF.transform(HashingTF.scala:64)
at com.pws.fantasySport.ml.TDIDFTest.runPipleLineTF_IDF(TDIDFTest.java:52)
at com.pws.fantasySport.ml.TDIDFTest.test(TDIDFTest.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.
java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.j
ava:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.ja
va:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.jav
a:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.jav
a:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.jav
a:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26
)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestRef
erence.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:3
8)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRu
nner.java:459)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRu
nner.java:675)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.
java:382)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner
.java:192)


From:  Jakob Odersky 
Date:  Monday, January 18, 2016 at 3:20 PM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: trouble using eclipse to view spark source code

> Have you followed the guide on how to import spark into eclipse
> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#Usefu
> lDeveloperTools-Eclipse ?
> 
> On 18 January 2016 at 13:04, Andy Davidson 
> wrote:
>> Hi 
>> 
>> My project is implemented using Java 8 and Python. Some times its handy to
>> look at the spark source code. For unknown reason if I open a spark project
>> my java projects show tons of compiler errors. I think it may have something
>> to do with Scala. If I close the projects my java code is fine.
>> 
>> I typically I only want to import the machine learning and streaming
>> projects.
>> 
>> I am not sure if this is an issue or not but my java projects are built using
>> gradel
>> 
>> In eclipse preferences -> scala -> installations I selected Scala: 2.10.6
>> (built in)
>> 
>> Any suggestions would be greatly appreciate
>> 
>> Andy
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Oleg Ruchovets
It looks spark is not working fine :

I followed this link ( http://spark.apache.org/docs/latest/ec2-scripts.html.
) and I see spot instances installed on EC2.

from spark shell I am counting lines and got connection exception.
*scala> val lines = sc.textFile("README.md")*
*scala> lines.count()*



*scala> val lines = sc.textFile("README.md")*

16/01/19 03:17:35 INFO storage.MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 26.5 KB, free 26.5 KB)
16/01/19 03:17:35 INFO storage.MemoryStore: Block broadcast_0_piece0 stored
as bytes in memory (estimated size 5.6 KB, free 32.1 KB)
16/01/19 03:17:35 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
in memory on 172.31.28.196:44028 (size: 5.6 KB, free: 511.5 MB)
16/01/19 03:17:35 INFO spark.SparkContext: Created broadcast 0 from
textFile at :21
lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile
at :21

*scala> lines.count()*

16/01/19 03:17:55 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:17:56 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:17:57 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:17:58 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
3 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:17:59 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
4 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:00 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
5 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:01 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
6 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:02 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
7 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:03 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
8 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:04 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
9 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
java.lang.RuntimeException: java.net.ConnectException: Call to
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000 failed on
connection exception: java.net.ConnectException: Connection refused
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:567)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:318)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:291)
at
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD.count(RDD.scala:1143)
at 

rdd.foreach return value

2016-01-18 Thread charles li
code snippet


​
the 'print' actually print info on the worker node, but I feel confused
where the 'return' value
goes to. for I get nothing on the driver node.
-- 
*--*
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao


Re: rdd.foreach return value

2016-01-18 Thread David Russell
The foreach operation on RDD has a void (Unit) return type. See attached. So 
there is no return value to the driver.

David

"All that is gold does not glitter, Not all those who wander are lost."



 Original Message 
Subject: rdd.foreach return value
Local Time: January 18 2016 10:34 pm
UTC Time: January 19 2016 3:34 am
From: charles.up...@gmail.com
To: user@spark.apache.org


code snippet




the 'print' actually print info on the worker node, but I feel confused where 
the 'return' value

goes to. for I get nothing on the driver node.
--


--
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao

foreach.png
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: rdd.foreach return value

2016-01-18 Thread Ted Yu
Here is signature for foreach:
 def foreach(f: T => Unit): Unit = withScope {

I don't think you can return element in the way shown in the snippet.

On Mon, Jan 18, 2016 at 7:34 PM, charles li  wrote:

> code snippet
>
>
> ​
> the 'print' actually print info on the worker node, but I feel confused
> where the 'return' value
> goes to. for I get nothing on the driver node.
> --
> *--*
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao
>


How to call a custom function from GroupByKey which takes Iterable[Row] as input and returns a Map[Int,String] as output in scala

2016-01-18 Thread Neha Mehta
Hi,

I have a scenario wherein my dataset has around 30 columns. It is basically
user activity information. I need to group the information by each user and
then for each column/activity parameter I need to find the percentage
affinity for each value in that column for that user. Below is the sample
input and output.

UserId C1 C2 C3
1 A <20 0
1 A >20 & <40 1
1 B >20 & <40 0
1 C >20 & <40 0
1 C >20 & <40 0
2 A <20 0
3 B >20 & <40 1
3 B >40 2








Output


1 A:0.4|B:0.2|C:0.4 <20:02|>20 & <40:0.8 0:0.8|1:0.2
2 A:1 <20:1 0:01
3 B:1 >20 & <40:0.5|>40:0.5 1:0.5|2:0.5

Presently this is how I am calculating these values:
Group by UserId and C1 and compute values for C1 for all the users, then do
a group by by Userid and C2 and find the fractions for C2 for each user and
so on. This approach is quite slow.  Also the number of records for each
user will be at max 30. So I would like to take a second approach wherein I
do a groupByKey and pass the entire list of records for each key to a
function which computes all the percentages for each column for each user
at once. Below are the steps I am trying to follow:

1. Dataframe1 => group by UserId , find the counts of records for each
user. Join the results back to the input so that counts are available with
each record
2. Dataframe1.map(s=>s(1),s).groupByKey().map(s=>myUserAggregator(s._2))

def myUserAggregator(rows: Iterable[Row]):
scala.collection.mutable.Map[Int,String] = {
val returnValue = scala.collection.mutable.Map[Int,String]()
if (rows != null) {
  val activityMap = scala.collection.mutable.Map[Int,
scala.collection.mutable.Map[String,
Int]]().withDefaultValue(scala.collection.mutable.Map[String,
Int]().withDefaultValue(0))

  val rowIt = rows.iterator
  var sentCount = 1
  for (row <- rowIt) {
sentCount = row(29).toString().toInt
for (i <- 0 until row.length) {
  var m = activityMap(i)
  if (activityMap(i) == null) {
m = collection.mutable.Map[String,
Int]().withDefaultValue(0)
  }
  m(row(i).toString()) += 1
  activityMap.update(i, m)
}
  }
  var activityPPRow: Row = Row()
  for((k,v) <- activityMap) {
  var rowVal:String = ""
  for((a,b) <- v) {
rowVal += rowVal + a + ":" + b/sentCount + "|"
  }
  returnValue.update(k, rowVal)
//  activityPPRow.apply(k) = rowVal
  }

}
return returnValue
  }

When I run step 2 I get the following error. I am new to Scala and Spark
and am unable to figure out how to pass the Iterable[Row] to a function and
get back the results.

org.apache.spark.SparkException: Task not serializable
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.map(RDD.scala:317)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:97)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:102)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:104)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:106)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:108)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:110)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:112)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:114)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:116)
..


Thanks for the help.

Regards,
Neha Mehta


Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Peter Zhang
Could you run spark-shell at $SPARK_HOME DIR?

You can try to change you command run at $SPARK_HOME or, point to README.md 
with full path.


Peter Zhang
-- 
Google
Sent with Airmail

On January 19, 2016 at 11:26:14, Oleg Ruchovets (oruchov...@gmail.com) wrote:

It looks spark is not working fine : 
 
I followed this link ( http://spark.apache.org/docs/latest/ec2-scripts.html. ) 
and I see spot instances installed on EC2.

from spark shell I am counting lines and got connection exception.
scala> val lines = sc.textFile("README.md")
scala> lines.count()



scala> val lines = sc.textFile("README.md")

16/01/19 03:17:35 INFO storage.MemoryStore: Block broadcast_0 stored as values 
in memory (estimated size 26.5 KB, free 26.5 KB)
16/01/19 03:17:35 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as 
bytes in memory (estimated size 5.6 KB, free 32.1 KB)
16/01/19 03:17:35 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 172.31.28.196:44028 (size: 5.6 KB, free: 511.5 MB)
16/01/19 03:17:35 INFO spark.SparkContext: Created broadcast 0 from textFile at 
:21
lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at 
:21

scala> lines.count()

16/01/19 03:17:55 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried 0 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1 SECONDS)
16/01/19 03:17:56 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried 1 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1 SECONDS)
16/01/19 03:17:57 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried 2 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1 SECONDS)
16/01/19 03:17:58 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried 3 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1 SECONDS)
16/01/19 03:17:59 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried 4 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1 SECONDS)
16/01/19 03:18:00 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried 5 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1 SECONDS)
16/01/19 03:18:01 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried 6 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1 SECONDS)
16/01/19 03:18:02 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried 7 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1 SECONDS)
16/01/19 03:18:03 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried 8 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1 SECONDS)
16/01/19 03:18:04 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried 9 
time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
sleepTime=1 SECONDS)
java.lang.RuntimeException: java.net.ConnectException: Call to 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000 failed on 
connection exception: java.net.ConnectException: Connection refused
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:567)
at 
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:318)
at 
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:291)
at 
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at 
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at 

Spark Streaming - Latest batch-time can't keep up with current time

2016-01-18 Thread Collin Shi
Hi all, 

After having submit the job, the latest batch-time is almost as same as current 
time at first. Let's say, if current time is '12:00:00', then the latest 
batch-time would be '11:59:59'. But as time goes, the difference is getting 
greater and greater. For instance , current time is '13:00:00' but the latest 
batch-time is '12:40:00'. 

The total-delay is sometimes greater than batch-interval, but it won't have 
influence on batch-time right ?

How could I fix this problem?

I'am using spark 1.4.0 and kafka. 



Thanks,
Collin


SparkR with Hive integration

2016-01-18 Thread Peter Zhang
Hi all,

http://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes
From Hive tables
You can also create SparkR DataFrames from Hive tables. To do this we will need 
to create a HiveContext which can access tables in the Hive MetaStore. Note 
that Spark should have been built with Hive support and more details on the 
difference between SQLContext and HiveContext can be found in the SQL 
programming guide.

# sc is an existing SparkContext.
hiveContext <- sparkRHive.init(sc)

sql(hiveContext, "CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql(hiveContext, "LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' 
INTO TABLE src")

# Queries can be expressed in HiveQL.
results <- sql(hiveContext, "FROM src SELECT key, value")

# results is now a DataFrame
head(results)
##  key   value
## 1 238 val_238
## 2  86  val_86
## 3 311 val_311

I use RStudio to run above command, when I run "sql(hiveContext, "CREATE TABLE 
IF NOT EXISTS src (key INT, value STRING)”)”

I got exception: 16/01/19 12:11:51 INFO FileUtils: Creating directory if it 
doesn't exist: file:/user/hive/warehouse/src 16/01/19 12:11:51 ERROR DDLTask: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:file:/user/hive/warehouse/src is not a directory or 
unable to create one)

How  to use HDFS instead of local file system(file)?
Which parameter should to set?

Thanks a lot.


Peter Zhang
-- 
Google
Sent with Airmail

Re: SparkR with Hive integration

2016-01-18 Thread Peter Zhang
Thanks, 

I will try.

Peter

-- 
Google
Sent with Airmail

On January 19, 2016 at 12:44:46, Jeff Zhang (zjf...@gmail.com) wrote:

Please make sure you export environment variable HADOOP_CONF_DIR which contains 
the core-site.xml

On Mon, Jan 18, 2016 at 8:23 PM, Peter Zhang  wrote:
Hi all,

http://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes
From Hive tables
You can also create SparkR DataFrames from Hive tables. To do this we will need 
to create a HiveContext which can access tables in the Hive MetaStore. Note 
that Spark should have been built with Hive support and more details on the 
difference between SQLContext and HiveContext can be found in the SQL 
programming guide.


# sc is an existing SparkContext.
hiveContext <- sparkRHive.init(sc)

sql(hiveContext, "CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql(hiveContext, "LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' 
INTO TABLE src")

# Queries can be expressed in HiveQL.
results <- sql(hiveContext, "FROM src SELECT key, value")

# results is now a DataFrame
head(results)
##  key   value
## 1 238 val_238
## 2  86  val_86
## 3 311 val_311

I use RStudio to run above command, when I run "sql(hiveContext,  
"CREATE TABLE IF NOT EXISTS src (key INT, value
STRING)”)”

I got
exception: 16/01/19 12:11:51 INFO FileUtils: Creating directory if it doesn't 
exist: file:/user/hive/warehouse/src 16/01/19 12:11:51 ERROR DDLTask: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:file:/user/hive/warehouse/src is not a directory or 
unable to create one)

How  to use HDFS instead of local file system(file)?
Which parameter should to set?

Thanks a lot.


Peter Zhang
-- 
Google
Sent with Airmail



--
Best Regards

Jeff Zhang

Re: rdd.foreach return value

2016-01-18 Thread charles li
hi, great thanks to david and ted, I know that the content of RDD can be
returned to driver using 'collect' method.

but my question is:


1. cause we can write any code we like in the function put into 'foreach',
so what happened when we actually write a 'return' sentence in the foreach
function?
2. as the photo shows bellow, the content of RDD doesn't change after
foreach function, why?
3. I feel a little confused about the 'foreach' method, it should be an
'action', right? cause it return nothing. or is there any best practice of
the 'foreach' funtion? or can some one put your code snippet when using
'foreach' method in your application, that would be awesome.


great thanks again



​

On Tue, Jan 19, 2016 at 11:44 AM, Ted Yu  wrote:

> Here is signature for foreach:
>  def foreach(f: T => Unit): Unit = withScope {
>
> I don't think you can return element in the way shown in the snippet.
>
> On Mon, Jan 18, 2016 at 7:34 PM, charles li 
> wrote:
>
>> code snippet
>>
>>
>> ​
>> the 'print' actually print info on the worker node, but I feel confused
>> where the 'return' value
>> goes to. for I get nothing on the driver node.
>> --
>> *--*
>> a spark lover, a quant, a developer and a good man.
>>
>> http://github.com/litaotao
>>
>
>


-- 
*--*
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao


Re: rdd.foreach return value

2016-01-18 Thread Ted Yu
For #2, RDD is immutable. 

> On Jan 18, 2016, at 8:10 PM, charles li  wrote:
> 
> 
> hi, great thanks to david and ted, I know that the content of RDD can be 
> returned to driver using 'collect' method.
> 
> but my question is:
> 
> 
> 1. cause we can write any code we like in the function put into 'foreach', so 
> what happened when we actually write a 'return' sentence in the foreach 
> function?
> 2. as the photo shows bellow, the content of RDD doesn't change after foreach 
> function, why?
> 3. I feel a little confused about the 'foreach' method, it should be an 
> 'action', right? cause it return nothing. or is there any best practice of 
> the 'foreach' funtion? or can some one put your code snippet when using 
> 'foreach' method in your application, that would be awesome. 
> 
> 
> great thanks again
> 
> 
> 
> ​
> 
>> On Tue, Jan 19, 2016 at 11:44 AM, Ted Yu  wrote:
>> Here is signature for foreach:
>>  def foreach(f: T => Unit): Unit = withScope {
>> 
>> I don't think you can return element in the way shown in the snippet.
>> 
>>> On Mon, Jan 18, 2016 at 7:34 PM, charles li  wrote:
>>> code snippet
>>> 
>>> <屏幕快照 2016-01-19 上午11.32.05.png>
>>> ​
>>> the 'print' actually print info on the worker node, but I feel confused 
>>> where the 'return' value 
>>> goes to. for I get nothing on the driver node.
>>> -- 
>>> --
>>> a spark lover, a quant, a developer and a good man.
>>> 
>>> http://github.com/litaotao
> 
> 
> 
> -- 
> --
> a spark lover, a quant, a developer and a good man.
> 
> http://github.com/litaotao


Re: rdd.foreach return value

2016-01-18 Thread Vishal Maru
1. foreach doesn't expect any value from function being passed (in your
func_foreach). so nothing happens. The return values are just lost. it's
like calling a function without saving return value to another var.
foreach also doesn't return anything so you don't get modified RDD (like
map*).
2. RDD's are immutable. All transform functions (map*,groupBy*,reduceBy
etc.) return new RDD.
3. Yes. It's just iterates through elements and calls the function being
passed. That's it. It doesn't collect the values and don't return any new
modified RDD.


On Mon, Jan 18, 2016 at 11:10 PM, charles li 
wrote:

>
> hi, great thanks to david and ted, I know that the content of RDD can be
> returned to driver using 'collect' method.
>
> but my question is:
>
>
> 1. cause we can write any code we like in the function put into 'foreach',
> so what happened when we actually write a 'return' sentence in the foreach
> function?
> 2. as the photo shows bellow, the content of RDD doesn't change after
> foreach function, why?
> 3. I feel a little confused about the 'foreach' method, it should be an
> 'action', right? cause it return nothing. or is there any best practice of
> the 'foreach' funtion? or can some one put your code snippet when using
> 'foreach' method in your application, that would be awesome.
>
>
> great thanks again
>
>
>
> ​
>
> On Tue, Jan 19, 2016 at 11:44 AM, Ted Yu  wrote:
>
>> Here is signature for foreach:
>>  def foreach(f: T => Unit): Unit = withScope {
>>
>> I don't think you can return element in the way shown in the snippet.
>>
>> On Mon, Jan 18, 2016 at 7:34 PM, charles li 
>> wrote:
>>
>>> code snippet
>>>
>>>
>>> ​
>>> the 'print' actually print info on the worker node, but I feel confused
>>> where the 'return' value
>>> goes to. for I get nothing on the driver node.
>>> --
>>> *--*
>>> a spark lover, a quant, a developer and a good man.
>>>
>>> http://github.com/litaotao
>>>
>>
>>
>
>
> --
> *--*
> a spark lover, a quant, a developer and a good man.
>
> http://github.com/litaotao
>


Re: rdd.foreach return value

2016-01-18 Thread charles li
got it, great thanks, Vishal, Ted and David

On Tue, Jan 19, 2016 at 1:10 PM, Vishal Maru  wrote:

> 1. foreach doesn't expect any value from function being passed (in your
> func_foreach). so nothing happens. The return values are just lost. it's
> like calling a function without saving return value to another var.
> foreach also doesn't return anything so you don't get modified RDD (like
> map*).
> 2. RDD's are immutable. All transform functions (map*,groupBy*,reduceBy
> etc.) return new RDD.
> 3. Yes. It's just iterates through elements and calls the function being
> passed. That's it. It doesn't collect the values and don't return any new
> modified RDD.
>
>
> On Mon, Jan 18, 2016 at 11:10 PM, charles li 
> wrote:
>
>>
>> hi, great thanks to david and ted, I know that the content of RDD can be
>> returned to driver using 'collect' method.
>>
>> but my question is:
>>
>>
>> 1. cause we can write any code we like in the function put into
>> 'foreach', so what happened when we actually write a 'return' sentence in
>> the foreach function?
>> 2. as the photo shows bellow, the content of RDD doesn't change after
>> foreach function, why?
>> 3. I feel a little confused about the 'foreach' method, it should be an
>> 'action', right? cause it return nothing. or is there any best practice of
>> the 'foreach' funtion? or can some one put your code snippet when using
>> 'foreach' method in your application, that would be awesome.
>>
>>
>> great thanks again
>>
>>
>>
>> ​
>>
>> On Tue, Jan 19, 2016 at 11:44 AM, Ted Yu  wrote:
>>
>>> Here is signature for foreach:
>>>  def foreach(f: T => Unit): Unit = withScope {
>>>
>>> I don't think you can return element in the way shown in the snippet.
>>>
>>> On Mon, Jan 18, 2016 at 7:34 PM, charles li 
>>> wrote:
>>>
 code snippet


 ​
 the 'print' actually print info on the worker node, but I feel confused
 where the 'return' value
 goes to. for I get nothing on the driver node.
 --
 *--*
 a spark lover, a quant, a developer and a good man.

 http://github.com/litaotao

>>>
>>>
>>
>>
>> --
>> *--*
>> a spark lover, a quant, a developer and a good man.
>>
>> http://github.com/litaotao
>>
>
>


-- 
*--*
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao


Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Oleg Ruchovets
I am running from  $SPARK_HOME.
It looks like connection  problem to port 9000. It is on master machine.
What is this process is spark tries to connect?
Should I start any framework , processes before executing spark?

Thanks
OIeg.


16/01/19 03:17:56 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:17:57 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:17:58 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
3 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:17:59 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
4 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:00 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
5 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:01 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
6 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:02 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
7 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:03 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
8 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:04 INFO ipc.Client: Retrying connect to server:
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already tried
9 time(s); retry

On Tue, Jan 19, 2016 at 1:13 PM, Peter Zhang  wrote:

> Could you run spark-shell at $SPARK_HOME DIR?
>
> You can try to change you command run at $SPARK_HOME or, point to
> README.md with full path.
>
>
> Peter Zhang
> --
> Google
> Sent with Airmail
>
> On January 19, 2016 at 11:26:14, Oleg Ruchovets (oruchov...@gmail.com)
> wrote:
>
> It looks spark is not working fine :
>
> I followed this link (
> http://spark.apache.org/docs/latest/ec2-scripts.html. ) and I see spot
> instances installed on EC2.
>
> from spark shell I am counting lines and got connection exception.
> *scala> val lines = sc.textFile("README.md")*
> *scala> lines.count()*
>
>
>
> *scala> val lines = sc.textFile("README.md")*
>
> 16/01/19 03:17:35 INFO storage.MemoryStore: Block broadcast_0 stored as
> values in memory (estimated size 26.5 KB, free 26.5 KB)
> 16/01/19 03:17:35 INFO storage.MemoryStore: Block broadcast_0_piece0
> stored as bytes in memory (estimated size 5.6 KB, free 32.1 KB)
> 16/01/19 03:17:35 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on 172.31.28.196:44028 (size: 5.6 KB, free: 511.5 MB)
> 16/01/19 03:17:35 INFO spark.SparkContext: Created broadcast 0 from
> textFile at :21
> lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile
> at :21
>
> *scala> lines.count()*
>
> 16/01/19 03:17:55 INFO ipc.Client: Retrying connect to server:
> ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already
> tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> 16/01/19 03:17:56 INFO ipc.Client: Retrying connect to server:
> ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already
> tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> 16/01/19 03:17:57 INFO ipc.Client: Retrying connect to server:
> ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already
> tried 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> 16/01/19 03:17:58 INFO ipc.Client: Retrying connect to server:
> ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already
> tried 3 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> 16/01/19 03:17:59 INFO ipc.Client: Retrying connect to server:
> ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000. Already
> tried 4 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> 16/01/19 03:18:00 INFO ipc.Client: Retrying connect to server:
> 

Re: rdd.foreach return value

2016-01-18 Thread charles li
thanks, david and ted, I know that the content of RDD can be returned to
driver using `collect


​

On Tue, Jan 19, 2016 at 11:44 AM, Ted Yu  wrote:

> Here is signature for foreach:
>  def foreach(f: T => Unit): Unit = withScope {
>
> I don't think you can return element in the way shown in the snippet.
>
> On Mon, Jan 18, 2016 at 7:34 PM, charles li 
> wrote:
>
>> code snippet
>>
>>
>> ​
>> the 'print' actually print info on the worker node, but I feel confused
>> where the 'return' value
>> goes to. for I get nothing on the driver node.
>> --
>> *--*
>> a spark lover, a quant, a developer and a good man.
>>
>> http://github.com/litaotao
>>
>
>


-- 
*--*
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao


Re: SparkR with Hive integration

2016-01-18 Thread Jeff Zhang
Please make sure you export environment variable HADOOP_CONF_DIR which
contains the core-site.xml

On Mon, Jan 18, 2016 at 8:23 PM, Peter Zhang  wrote:

> Hi all,
>
> http://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes
> From Hive tables
> 
>
> You can also create SparkR DataFrames from Hive tables. To do this we will
> need to create a HiveContext which can access tables in the Hive MetaStore.
> Note that Spark should have been built with Hive support
> 
>  and
> more details on the difference between SQLContext and HiveContext can be
> found in the SQL programming guide
> 
> .
>
> # sc is an existing SparkContext.
> hiveContext <- sparkRHive.init(sc)
>
> sql(hiveContext, "CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
> sql(hiveContext, "LOAD DATA LOCAL INPATH 
> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
> # Queries can be expressed in HiveQL.
> results <- sql(hiveContext, "FROM src SELECT key, value")
> # results is now a DataFramehead(results)##  key   value## 1 238 val_238## 2  
> 86  val_86## 3 311 val_311
>
>
> I use RStudio to run above command, when I run "sql(hiveContext, "CREATE
> TABLE IF NOT EXISTS src (key INT, value STRING)”)”
>
> I got exception: 16/01/19 12:11:51 INFO FileUtils: Creating directory if
> it doesn't exist: file:/user/hive/warehouse/src 16/01/19 12:11:51 ERROR
> DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException:
> MetaException(message:file:/user/hive/warehouse/src is not a directory or
> unable to create one)
>
> How  to use HDFS instead of local file system(file)?
> Which parameter should to set?
>
> Thanks a lot.
>
>
> Peter Zhang
> --
> Google
> Sent with Airmail
>



-- 
Best Regards

Jeff Zhang


Re: Spark 1.6.0, yarn-shuffle

2016-01-18 Thread johd
Hi,

No, i have not. :-/

Regards, J



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-0-yarn-shuffle-tp25961p26002.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: rdd.foreach return value

2016-01-18 Thread Darren Govoni


What's the rationale behind that? It certainly limits the kind of flow logic we 
can do in one statement.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: David Russell  
Date: 01/18/2016  10:44 PM  (GMT-05:00) 
To: charles li  
Cc: user@spark.apache.org 
Subject: Re: rdd.foreach return value 

The foreach operation on RDD has a void (Unit) return type. See attached. So 
there is no return value to the driver.

David

"All that is gold does not glitter, Not all those who wander are lost."


 Original Message 
Subject: rdd.foreach return value
Local Time: January 18 2016 10:34 pm
UTC Time: January 19 2016 3:34 am
From: charles.up...@gmail.com
To: user@spark.apache.org

code snippet



the 'print' actually print info on the worker node, but I feel confused where 
the 'return' value 
goes to. for I get nothing on the driver node.
-- 
--
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao



Re: Number of CPU cores for a Spark Streaming app in Standalone mode

2016-01-18 Thread radoburansky
I am adding an answer from SO:
http://stackoverflow.com/questions/34861947/read-more-kafka-topics-than-number-of-cpu-cores

  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Number-of-CPU-cores-for-a-Spark-Streaming-app-in-Standalone-mode-tp25997p26001.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread vivek.meghanathan
Have you verified the spark master/slaves are started correctly? Please check 
using netstat command and open ports mode. Are they listening? Binds to which 
address etc..

From: Oleg Ruchovets [mailto:oruchov...@gmail.com]
Sent: 19 January 2016 11:24
To: Peter Zhang 
Cc: Daniel Darabos ; user 

Subject: Re: spark 1.6.0 on ec2 doesn't work

I am running from  $SPARK_HOME.
It looks like connection  problem to port 9000. It is on master machine.
What is this process is spark tries to connect?
Should I start any framework , processes before executing spark?

Thanks
OIeg.


16/01/19 03:17:56 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000.
 Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:17:57 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000.
 Already tried 2 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:17:58 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000.
 Already tried 3 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:17:59 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000.
 Already tried 4 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:00 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000.
 Already tried 5 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:01 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000.
 Already tried 6 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:02 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000.
 Already tried 7 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:03 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000.
 Already tried 8 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
16/01/19 03:18:04 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000.
 Already tried 9 time(s); retry

On Tue, Jan 19, 2016 at 1:13 PM, Peter Zhang 
> wrote:
Could you run spark-shell at $SPARK_HOME DIR?

You can try to change you command run at $SPARK_HOME or, point to README.md 
with full path.


Peter Zhang
--
Google
Sent with Airmail


On January 19, 2016 at 11:26:14, Oleg Ruchovets 
(oruchov...@gmail.com) wrote:
It looks spark is not working fine :

I followed this link ( http://spark.apache.org/docs/latest/ec2-scripts.html. ) 
and I see spot instances installed on EC2.

from spark shell I am counting lines and got connection exception.
scala> val lines = sc.textFile("README.md")
scala> lines.count()



scala> val lines = sc.textFile("README.md")

16/01/19 03:17:35 INFO storage.MemoryStore: Block broadcast_0 stored as values 
in memory (estimated size 26.5 KB, free 26.5 KB)
16/01/19 03:17:35 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as 
bytes in memory (estimated size 5.6 KB, free 32.1 KB)
16/01/19 03:17:35 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 172.31.28.196:44028 (size: 5.6 KB, free: 
511.5 MB)
16/01/19 03:17:35 INFO spark.SparkContext: Created broadcast 0 from textFile at 
:21
lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at 
:21

scala> lines.count()

16/01/19 03:17:55 INFO ipc.Client: Retrying connect to server: 
ec2-54-88-242-197.compute-1.amazonaws.com/172.31.28.196:9000.
 Already tried 0 time(s); retry policy is 

when enable kerberos in hdp, the spark does not work

2016-01-18 Thread 李振
when enable kerberos in hdp, the spark does not work,the error is follow:  
Traceback (most recent call last):
  File "/home/lizhen/test.py", line 27, in 
abc = raw_data.count()
  File 
"/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1006, 
in count
  File 
"/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 997, 
in sum
  File 
"/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 871, 
in fold
  File 
"/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 773, 
in collect
  File 
"/usr/hdp/2.3.4.0-3485/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/usr/hdp/2.3.4.0-3485/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: java.net.ConnectException: Connection refused
at 
org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:888)
at 
org.apache.hadoop.crypto.key.KeyProviderDelegationTokenExtension.addDelegationTokens(KeyProviderDelegationTokenExtension.java:86)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2243)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at 
org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
at 
org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
at 
org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at 

a problem about using UDF at sparksql

2016-01-18 Thread ??????
Hi, everyone. Can anyone help to have a look at this problem?



Now I am migrating from HIVE to Sparksql. But I encountered a problem.


the temporary udf function which extends GenericUDF can create, and you can 
describe this udf function.
but when I use this function, some exception occurs.
I don't know where the problem is.
spark version: apache 1.5.1 /standalone mode/CLI

spark-sql> describe function devicemodel;
Function: deviceModel
Class: td.enterprise.hive.udfs.ConvertDeviceModel
Usage: deviceModel(expr) - return other information
Time taken: 0.041 seconds, Fetched 3 row(s)

spark-sql> 
> select devicemodel(p.mobile_id, 'type', 'MOBILE_TYPE') as mobile_type from 
> analytics_device_profile_zh3 p limit 1;
16/01/19 15:19:46 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 15)
com.esotericsoftware.kryo.KryoException: Buffer underflow.
Serialization trace:
mobileAttributeCahceMap (td.enterprise.hive.udfs.ConvertDeviceModel)
at com.esotericsoftware.kryo.io.Input.require(Input.java:156)
at com.esotericsoftware.kryo.io.Input.readAscii_slow(Input.java:580)
at com.esotericsoftware.kryo.io.Input.readAscii(Input.java:558)
at com.esotericsoftware.kryo.io.Input.readString(Input.java:436)
at 
com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157)
at 
com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at 
com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:134)
at 
com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at 
com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:134)
at 
com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:626)
at 
org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.deserializeObjectByKryo(HiveShim.scala:134)
at 
org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.deserializePlan(HiveShim.scala:150)
at 
org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.readExternal(HiveShim.scala:191)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 

using spark context in map funciton TASk not serilizable error

2016-01-18 Thread gpatcham
Hi,

I have a use case where I need to pass sparkcontext in map function 

reRDD.map(row =>method1(row,sc)).saveAsTextFile(outputDir)

Method1 needs spark context to query cassandra. But I see below error

java.io.NotSerializableException: org.apache.spark.SparkContext

Is there a way we can fix this ?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/using-spark-context-in-map-funciton-TASk-not-serilizable-error-tp25998.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Ted Yu
Can you pass the properties which are needed for accessing Cassandra
without going through SparkContext ?

SparkContext isn't designed to be used in the way illustrated below.

Cheers

On Mon, Jan 18, 2016 at 12:29 PM, gpatcham  wrote:

> Hi,
>
> I have a use case where I need to pass sparkcontext in map function
>
> reRDD.map(row =>method1(row,sc)).saveAsTextFile(outputDir)
>
> Method1 needs spark context to query cassandra. But I see below error
>
> java.io.NotSerializableException: org.apache.spark.SparkContext
>
> Is there a way we can fix this ?
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/using-spark-context-in-map-funciton-TASk-not-serilizable-error-tp25998.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Incorrect timeline for Scheduling Delay in Streaming page in web UI?

2016-01-18 Thread Jacek Laskowski
Hi Ryan,

Ah, you might be right! I didn't think about the batches queued up
(and hence without scheduling delay since they're not started yet).
Thanks a lot for responding!

Pozdrawiam,
Jacek

Jacek Laskowski | https://medium.com/@jaceklaskowski/
Mastering Apache Spark
==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski


On Mon, Jan 18, 2016 at 8:03 PM, Shixiong(Ryan) Zhu
 wrote:
> Hey, did you mean that the scheduling delay timeline is incorrect because
> it's too short and some values are missing? A batch won't have a scheduling
> delay until it starts to run. In your example, a lot of batches are waiting
> so that they don't have the scheduling delay.
>
> On Sun, Jan 17, 2016 at 4:49 AM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> I'm trying to understand how Scheduling Delays are displayed in
>> Streaming page in web UI and think the values are displayed
>> incorrectly in the Timelines column. I'm only concerned with the
>> scheduling delays (on y axis) per batch times (x axis). It appears
>> that the values (on y axis) are correct, but not how they are
>> displayed per batch times.
>>
>> See the second screenshot in
>>
>> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-streaming-webui.html#scheduling-delay.
>>
>> Can anyone explain how the delays for batches per batch time should be
>> read? I'm specifically asking about the timeline (not histogram as it
>> seems fine).
>>
>> Pozdrawiam,
>> Jacek
>>
>> Jacek Laskowski | https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark
>> ==> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL create table

2016-01-18 Thread Raghu Ganti
Great, I got that to work following your example! Thanks.

A followup question is: If I had a custom SQL type (UserDefinedType),
how can I map it to this type from the RDD in the DataFrame?

Regards

On Mon, Jan 18, 2016 at 1:35 PM, Ted Yu  wrote:

> By SparkSQLContext, I assume you mean SQLContext.
> From the doc for SQLContext#createDataFrame():
>
>*  dataFrame.registerTempTable("people")
>*  sqlContext.sql("select name from people").collect.foreach(println)
>
> If you want to persist table externally, you need Hive, etc
>
> Regards
>
> On Mon, Jan 18, 2016 at 10:28 AM, Raghu Ganti 
> wrote:
>
>> This requires Hive to be installed and uses HiveContext, right?
>>
>> What is the SparkSQLContext useful for?
>>
>> On Mon, Jan 18, 2016 at 1:27 PM, Ted Yu  wrote:
>>
>>> Please take a look
>>> at 
>>> sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDataFrameSuite.scala
>>>
>>> On Mon, Jan 18, 2016 at 9:57 AM, raghukiran 
>>> wrote:
>>>
 Is creating a table using the SparkSQLContext currently supported?

 Regards,
 Raghu



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-create-table-tp25996.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


is recommendProductsForUsers available in ALS?

2016-01-18 Thread Roberto Pagliari
With Spark 1.5, the following code:

from pyspark import SparkContext, SparkConf
from pyspark.mllib.recommendation import ALS, Rating
r1 = (1, 1, 1.0)
r2 = (1, 2, 2.0)
r3 = (2, 1, 2.0)
ratings = sc.parallelize([r1, r2, r3])
model = ALS.trainImplicit(ratings, 1, seed=10)

res = model.recommendProductsForUsers(2)

raises the error

---
AttributeErrorTraceback (most recent call last)
 in ()
  7 model = ALS.trainImplicit(ratings, 1, seed=10)
  8
> 9 res = model.recommendProductsForUsers(2)

AttributeError: 'MatrixFactorizationModel' object has no attribute 
'recommendProductsForUsers'

If the method is not available, is there a workaround with a large number of 
users and products?


Re: 答复: 答复: 答复: 答复: spark streaming context trigger invoke stop why?

2016-01-18 Thread Shixiong(Ryan) Zhu
I see. There is a bug in 1.4.1 that a thread pool is not set the daemon
flag for threads (
https://github.com/apache/spark/commit/346209097e88fe79015359e40b49c32cc0bdc439#diff-25124e4f06a1da237bf486eceb1f7967L47
)

So in 1.4.1, even if your main thread exits, threads in the thread pool is
still running and the shutdown hook for StreamingContext cannot be called.

Actually, it's usually dangerous to ignore exceptions. If you really want
to, just use a `while(true)` loop to replace `awaitTermination`.


On Sat, Jan 16, 2016 at 12:02 AM, Triones,Deng(vip.com) <
triones.d...@vipshop.com> wrote:

> Thanks for your response.
>
>
>
> As a notice that , when my spark version is 1.4.1 when that kind of error
> won’t cause driver stop. Another wise spark 1.5.2 will cause driver stop, I
> think there must be some change. As I notice the code @spark 1.5.2
>
>
>
> JobScheduler.scala  :  jobScheduler.reportError("Error generating jobs for
> time " + time, e)  or  jobScheduler.reportError("Error in job generator",
> e)
>
> --à ContextWaiter.scala : notifyError()
>
>-à ContextWaiter.scala : waitForStopOrError()  then driver
> stop.
>
>
>
> According the driver log I have not seen message like “Error generating
> jobs for time” or “Error in job generator”
>
>
>
>
>
> Driver log as below :
>
>
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 410 in stage 215.0 failed 4 times, most recent
> failure: Lost task 410.3 in stage 215.0 (TID 178094, 10.201
>
> .114.142): java.lang.Exception: Could not compute split, block
> input-22-1452641669000 not found
>
> at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> Driver stacktrace:
>
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>
> at scala.Option.foreach(Option.scala:236)
>
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
>
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> at
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:902)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:900)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>
> at
> 

Re: [Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Jerry Lam
Thanks Alex:

So you suggested something like:
from_utc_timestamp(to_utc_timestamp(from_unixtime(1389802875),'America/Montreal'),
'America/Los_Angeles')?

This is a lot of conversion :)

Is there a particular reason not to have from_unixtime to take timezone
information?

I think I will make a UDF if this is the only way out of the box.

Thanks!

Jerry

On Mon, Jan 18, 2016 at 2:32 PM, Alexander Pivovarov 
wrote:

> Look at
> to_utc_timestamp
>
> from_utc_timestamp
> On Jan 18, 2016 9:39 AM, "Jerry Lam"  wrote:
>
>> Hi spark users and developers,
>>
>> what do you do if you want the from_unixtime function in spark sql to
>> return the timezone you want instead of the system timezone?
>>
>> Best Regards,
>>
>> Jerry
>>
>


Re: [Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Alexander Pivovarov
If you can find the function in Oracle or Mysql or Postgress which works
better then we can create similar one.

Timezone convertion is tricky because of daylight saving time.
so better to use UTC without dst in database/DW
On Jan 18, 2016 1:24 PM, "Jerry Lam"  wrote:

> Thanks Alex:
>
> So you suggested something like:
> from_utc_timestamp(to_utc_timestamp(from_unixtime(1389802875),'America/Montreal'),
> 'America/Los_Angeles')?
>
> This is a lot of conversion :)
>
> Is there a particular reason not to have from_unixtime to take timezone
> information?
>
> I think I will make a UDF if this is the only way out of the box.
>
> Thanks!
>
> Jerry
>
> On Mon, Jan 18, 2016 at 2:32 PM, Alexander Pivovarov  > wrote:
>
>> Look at
>> to_utc_timestamp
>>
>> from_utc_timestamp
>> On Jan 18, 2016 9:39 AM, "Jerry Lam"  wrote:
>>
>>> Hi spark users and developers,
>>>
>>> what do you do if you want the from_unixtime function in spark sql to
>>> return the timezone you want instead of the system timezone?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>
>


Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Giri P
I'm using spark cassandra connector to do this and the way we access
cassandra table is

sc.cassandraTable("keySpace", "tableName")

Thanks
Giri

On Mon, Jan 18, 2016 at 12:37 PM, Ted Yu  wrote:

> Can you pass the properties which are needed for accessing Cassandra
> without going through SparkContext ?
>
> SparkContext isn't designed to be used in the way illustrated below.
>
> Cheers
>
> On Mon, Jan 18, 2016 at 12:29 PM, gpatcham  wrote:
>
>> Hi,
>>
>> I have a use case where I need to pass sparkcontext in map function
>>
>> reRDD.map(row =>method1(row,sc)).saveAsTextFile(outputDir)
>>
>> Method1 needs spark context to query cassandra. But I see below error
>>
>> java.io.NotSerializableException: org.apache.spark.SparkContext
>>
>> Is there a way we can fix this ?
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/using-spark-context-in-map-funciton-TASk-not-serilizable-error-tp25998.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Giri P
Can we use @transient ?


On Mon, Jan 18, 2016 at 12:44 PM, Giri P  wrote:

> I'm using spark cassandra connector to do this and the way we access
> cassandra table is
>
> sc.cassandraTable("keySpace", "tableName")
>
> Thanks
> Giri
>
> On Mon, Jan 18, 2016 at 12:37 PM, Ted Yu  wrote:
>
>> Can you pass the properties which are needed for accessing Cassandra
>> without going through SparkContext ?
>>
>> SparkContext isn't designed to be used in the way illustrated below.
>>
>> Cheers
>>
>> On Mon, Jan 18, 2016 at 12:29 PM, gpatcham  wrote:
>>
>>> Hi,
>>>
>>> I have a use case where I need to pass sparkcontext in map function
>>>
>>> reRDD.map(row =>method1(row,sc)).saveAsTextFile(outputDir)
>>>
>>> Method1 needs spark context to query cassandra. But I see below error
>>>
>>> java.io.NotSerializableException: org.apache.spark.SparkContext
>>>
>>> Is there a way we can fix this ?
>>>
>>> Thanks
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/using-spark-context-in-map-funciton-TASk-not-serilizable-error-tp25998.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: building spark 1.6 throws error Rscript: command not found

2016-01-18 Thread Ted Yu
Please see:
http://www.jason-french.com/blog/2013/03/11/installing-r-in-linux/

On Mon, Jan 18, 2016 at 1:22 PM, Mich Talebzadeh 
wrote:

> ./make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6
> -Phive -Phive-thriftserver -Pyarn
>
>
>
>
>
> INFO] --- exec-maven-plugin:1.4.0:exec (sparkr-pkg) @ spark-core_2.10 ---
>
> *../R/install-dev.sh: line 40: Rscript: command not found*
>
> [INFO]
> 
>
> [INFO] Reactor Summary:
>
> [INFO]
>
> [INFO] Spark Project Parent POM ... SUCCESS [
> 2.921 s]
>
> [INFO] Spark Project Test Tags  SUCCESS [
> 2.921 s]
>
> [INFO] Spark Project Launcher . SUCCESS [
> 17.252 s]
>
> [INFO] Spark Project Networking ... SUCCESS [
> 9.237 s]
>
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
> 4.969 s]
>
> [INFO] Spark Project Unsafe ... SUCCESS [
> 13.384 s]
>
> [INFO] Spark Project Core . FAILURE [01:34
> min]
>
>
>
>
>
> How can I resolve this by any chance?
>
>
>
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>


Number of CPU cores for a Spark Streaming app in Standalone mode

2016-01-18 Thread radoburansky
I somehow don't want to believe this waste of resources. Is it really true
that if I have 20 input streams I must have at least 21 CPU cores? Even if I
read only once per minute and only a few messages? I still hope that I miss
an important information. Thanks a lot



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Number-of-CPU-cores-for-a-Spark-Streaming-app-in-Standalone-mode-tp25997.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



trouble using eclipse to view spark source code

2016-01-18 Thread Andy Davidson
Hi 

My project is implemented using Java 8 and Python. Some times its handy to
look at the spark source code. For unknown reason if I open a spark project
my java projects show tons of compiler errors. I think it may have something
to do with Scala. If I close the projects my java code is fine.

I typically I only want to import the machine learning and streaming
projects.

I am not sure if this is an issue or not but my java projects are built
using gradel

In eclipse preferences -> scala -> installations I selected Scala: 2.10.6
(built in)

Any suggestions would be greatly appreciate

Andy








-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Ted Yu
Did you mean constructing SparkContext on the worker nodes ?

Not sure whether that would work.

Doesn't seem to be good practice.

On Mon, Jan 18, 2016 at 1:27 PM, Giri P  wrote:

> Can we use @transient ?
>
>
> On Mon, Jan 18, 2016 at 12:44 PM, Giri P  wrote:
>
>> I'm using spark cassandra connector to do this and the way we access
>> cassandra table is
>>
>> sc.cassandraTable("keySpace", "tableName")
>>
>> Thanks
>> Giri
>>
>> On Mon, Jan 18, 2016 at 12:37 PM, Ted Yu  wrote:
>>
>>> Can you pass the properties which are needed for accessing Cassandra
>>> without going through SparkContext ?
>>>
>>> SparkContext isn't designed to be used in the way illustrated below.
>>>
>>> Cheers
>>>
>>> On Mon, Jan 18, 2016 at 12:29 PM, gpatcham  wrote:
>>>
 Hi,

 I have a use case where I need to pass sparkcontext in map function

 reRDD.map(row =>method1(row,sc)).saveAsTextFile(outputDir)

 Method1 needs spark context to query cassandra. But I see below error

 java.io.NotSerializableException: org.apache.spark.SparkContext

 Is there a way we can fix this ?

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/using-spark-context-in-map-funciton-TASk-not-serilizable-error-tp25998.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Giri P
yes I tried doing that but that doesn't work.

I'm looking at using SQLContext and dataframes. Is SQLCOntext serializable?

On Mon, Jan 18, 2016 at 1:29 PM, Ted Yu  wrote:

> Did you mean constructing SparkContext on the worker nodes ?
>
> Not sure whether that would work.
>
> Doesn't seem to be good practice.
>
> On Mon, Jan 18, 2016 at 1:27 PM, Giri P  wrote:
>
>> Can we use @transient ?
>>
>>
>> On Mon, Jan 18, 2016 at 12:44 PM, Giri P  wrote:
>>
>>> I'm using spark cassandra connector to do this and the way we access
>>> cassandra table is
>>>
>>> sc.cassandraTable("keySpace", "tableName")
>>>
>>> Thanks
>>> Giri
>>>
>>> On Mon, Jan 18, 2016 at 12:37 PM, Ted Yu  wrote:
>>>
 Can you pass the properties which are needed for accessing Cassandra
 without going through SparkContext ?

 SparkContext isn't designed to be used in the way illustrated below.

 Cheers

 On Mon, Jan 18, 2016 at 12:29 PM, gpatcham  wrote:

> Hi,
>
> I have a use case where I need to pass sparkcontext in map function
>
> reRDD.map(row =>method1(row,sc)).saveAsTextFile(outputDir)
>
> Method1 needs spark context to query cassandra. But I see below error
>
> java.io.NotSerializableException: org.apache.spark.SparkContext
>
> Is there a way we can fix this ?
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/using-spark-context-in-map-funciton-TASk-not-serilizable-error-tp25998.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>
>


Re: Number of CPU cores for a Spark Streaming app in Standalone mode

2016-01-18 Thread Tathagata Das
If you are using receiver-based input streams, then you have to dedicate 1
core to each receiver. If you read only once per minute on each receiver,
than consider consolidating the data reading pipeline such that you can use
fewer receivers.

On Mon, Jan 18, 2016 at 12:13 PM, radoburansky 
wrote:

> I somehow don't want to believe this waste of resources. Is it really true
> that if I have 20 input streams I must have at least 21 CPU cores? Even if
> I
> read only once per minute and only a few messages? I still hope that I miss
> an important information. Thanks a lot
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Number-of-CPU-cores-for-a-Spark-Streaming-app-in-Standalone-mode-tp25997.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: has any one implemented TF_IDF using ML transformers?

2016-01-18 Thread Andy Davidson
Hi Yanbo

I am using 1.6.0. I am having a hard of time trying to figure out what the
exact equation is. I do not know Scala.

I took a look a the source code URL  you provide. I do not know Scala

  override def transform(dataset: DataFrame): DataFrame = {
transformSchema(dataset.schema, logging = true)
val idf = udf { vec: Vector => idfModel.transform(vec) }
dataset.withColumn($(outputCol), idf(col($(inputCol
  }


You mentioned the doc is out of date.
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf

Based on my understanding of the subject matter the equations in the java
doc are correct. I could not find anything like the equations in the source
code?

IDF(t,D)=log|D|+1DF(t,D)+1,

TFIDF(t,d,D)=TF(t,d)・IDF(t,D).


I found the spark unit test org.apache.spark.mllib.feature.JavaTfIdfSuite
the results do not match equation. (In general the unit test asserts seem
incomplete). 


 I have created several small test example to try and figure out how to use
NaiveBase, HashingTF, and IDF. The values of TFIDF,  theta, probabilities ,
… The result produced by spark not match the published results at
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classifica
tion-1.html


Kind regards

Andy 

private DataFrame createTrainingData() {

// make sure we only use dictionarySize words

JavaRDD rdd = javaSparkContext.parallelize(Arrays.asList(

// 0 is Chinese

// 1 in notChinese

RowFactory.create(0, 0.0, Arrays.asList("Chinese",
"Beijing", "Chinese")),

RowFactory.create(1, 0.0, Arrays.asList("Chinese",
"Chinese", "Shanghai")),

RowFactory.create(2, 0.0, Arrays.asList("Chinese",
"Macao")),

RowFactory.create(3, 1.0, Arrays.asList("Tokyo", "Japan",
"Chinese";

   

return createData(rdd);

}



private DataFrame createData(JavaRDD rdd) {

StructField id = null;

id = new StructField("id", DataTypes.IntegerType, false,
Metadata.empty());



StructField label = null;

label = new StructField("label", DataTypes.DoubleType, false,
Metadata.empty());

   

StructField words = null;

words = new StructField("words",
DataTypes.createArrayType(DataTypes.StringType), false, Metadata.empty());



StructType schema = new StructType(new StructField[] { id, label,
words });

DataFrame ret = sqlContext.createDataFrame(rdd, schema);



return ret;

}



   private DataFrame runPipleLineTF_IDF(DataFrame rawDF) {

HashingTF hashingTF = new HashingTF()

.setInputCol("words")

.setOutputCol("tf")

.setNumFeatures(dictionarySize);



DataFrame termFrequenceDF = hashingTF.transform(rawDF);



termFrequenceDF.cache(); // idf needs to make 2 passes over data set

//val idf = new IDF(minDocFreq = 2).fit(tf)

IDFModel idf = new IDF()

//.setMinDocFreq(1) // our vocabulary has 6 words we
hash into 7

.setInputCol(hashingTF.getOutputCol())

.setOutputCol("idf")

.fit(termFrequenceDF);



DataFrame ret = idf.transform(termFrequenceDF);



return ret;

}



|-- id: integer (nullable = false)

 |-- label: double (nullable = false)

 |-- words: array (nullable = false)

 ||-- element: string (containsNull = true)

 |-- tf: vector (nullable = true)

 |-- idf: vector (nullable = true)



+---+-++-+--
-+

|id |label|words   |tf   |idf
|

+---+-++-+--
-+

|0  |0.0  |[Chinese, Beijing, Chinese] |(7,[1,2],[2.0,1.0])
|(7,[1,2],[0.0,0.9162907318741551]) |

|1  |0.0  |[Chinese, Chinese, Shanghai]|(7,[1,4],[2.0,1.0])
|(7,[1,4],[0.0,0.9162907318741551]) |

|2  |0.0  |[Chinese, Macao]|(7,[1,6],[1.0,1.0])
|(7,[1,6],[0.0,0.9162907318741551]) |

|3  |1.0  |[Tokyo, Japan, Chinese]
|(7,[1,3,5],[1.0,1.0,1.0])|(7,[1,3,5],[0.0,0.9162907318741551,0.916290731874
1551])|

+---+-++-+--
-+




Here is the spark test case



 @Test

  public void tfIdf() {

// The tests are to check Java compatibility.

HashingTF tf = new HashingTF();

@SuppressWarnings("unchecked")

JavaRDD documents = sc.parallelize(Arrays.asList(

  Arrays.asList("this is a sentence".split(" ")),

  Arrays.asList("this is another sentence".split(" ")),

  

building spark 1.6 throws error Rscript: command not found

2016-01-18 Thread Mich Talebzadeh
./make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.6
-Phive -Phive-thriftserver -Pyarn

 

 

INFO] --- exec-maven-plugin:1.4.0:exec (sparkr-pkg) @ spark-core_2.10 ---

../R/install-dev.sh: line 40: Rscript: command not found

[INFO]


[INFO] Reactor Summary:

[INFO]

[INFO] Spark Project Parent POM ... SUCCESS [  2.921
s]

[INFO] Spark Project Test Tags  SUCCESS [  2.921
s]

[INFO] Spark Project Launcher . SUCCESS [ 17.252
s]

[INFO] Spark Project Networking ... SUCCESS [  9.237
s]

[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  4.969
s]

[INFO] Spark Project Unsafe ... SUCCESS [ 13.384
s]

[INFO] Spark Project Core . FAILURE [01:34
min]

 

 

How can I resolve this by any chance?

 

 

Thanks

 

 

Dr Mich Talebzadeh

 

LinkedIn

https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 



Re: using spark context in map funciton TASk not serilizable error

2016-01-18 Thread Ted Yu
class SQLContext private[sql](
@transient val sparkContext: SparkContext,
@transient protected[sql] val cacheManager: CacheManager,
@transient private[sql] val listener: SQLListener,
val isRootContext: Boolean)
  extends org.apache.spark.Logging with Serializable {

FYI

On Mon, Jan 18, 2016 at 1:44 PM, Giri P  wrote:

> yes I tried doing that but that doesn't work.
>
> I'm looking at using SQLContext and dataframes. Is SQLCOntext serializable?
>
> On Mon, Jan 18, 2016 at 1:29 PM, Ted Yu  wrote:
>
>> Did you mean constructing SparkContext on the worker nodes ?
>>
>> Not sure whether that would work.
>>
>> Doesn't seem to be good practice.
>>
>> On Mon, Jan 18, 2016 at 1:27 PM, Giri P  wrote:
>>
>>> Can we use @transient ?
>>>
>>>
>>> On Mon, Jan 18, 2016 at 12:44 PM, Giri P  wrote:
>>>
 I'm using spark cassandra connector to do this and the way we access
 cassandra table is

 sc.cassandraTable("keySpace", "tableName")

 Thanks
 Giri

 On Mon, Jan 18, 2016 at 12:37 PM, Ted Yu  wrote:

> Can you pass the properties which are needed for accessing Cassandra
> without going through SparkContext ?
>
> SparkContext isn't designed to be used in the way illustrated below.
>
> Cheers
>
> On Mon, Jan 18, 2016 at 12:29 PM, gpatcham  wrote:
>
>> Hi,
>>
>> I have a use case where I need to pass sparkcontext in map function
>>
>> reRDD.map(row =>method1(row,sc)).saveAsTextFile(outputDir)
>>
>> Method1 needs spark context to query cassandra. But I see below error
>>
>> java.io.NotSerializableException: org.apache.spark.SparkContext
>>
>> Is there a way we can fix this ?
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/using-spark-context-in-map-funciton-TASk-not-serilizable-error-tp25998.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

>>>
>>
>