Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Although user can use the hdfs glob syntax to support multiple inputs. But
sometimes, it is not convenient to do that. Not sure why there's no api
of SparkContext#textFiles. It should be easy to implement that. I'd love to
create a ticket and contribute for that if there's no other consideration
that I don't know.

-- 
Best Regards

Jeff Zhang


Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread Nirmal Fernando
As of now, we are basically serializing the ML model and then deserialize
it for prediction at real time.

On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase  wrote:

> I don’t think this answers your question but here’s how you would evaluate
> the model in realtime in a streaming app
>
> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html
>
> Maybe you can find a way to extract portions of MLLib and run them outside
> of spark – loading the precomputed model and calling .predict on it…
>
> -adrian
>
> From: Andy Davidson
> Date: Tuesday, November 10, 2015 at 11:31 PM
> To: "user @spark"
> Subject: thought experiment: use spark ML to real time prediction
>
> Lets say I have use spark ML to train a linear model. I know I can save
> and load the model to disk. I am not sure how I can use the model in a real
> time environment. For example I do not think I can return a “prediction” to
> the client using spark streaming easily. Also for some applications the
> extra latency created by the batch process might not be acceptable.
>
> If I was not using spark I would re-implement the model I trained in my
> batch environment in a lang like Java  and implement a rest service that
> uses the model to create a prediction and return the prediction to the
> client. Many models make predictions using linear algebra. Implementing
> predictions is relatively easy if you have a good vectorized LA package. Is
> there a way to use a model I trained using spark ML outside of spark?
>
> As a motivating example, even if its possible to return data to the client
> using spark streaming. I think the mini batch latency would not be
> acceptable for a high frequency stock trading system.
>
> Kind regards
>
> Andy
>
> P.s. The examples I have seen so far use spark streaming to “preprocess”
> predictions. For example a recommender system might use what current users
> are watching to calculate “trending recommendations”. These are stored on
> disk and served up to users when the use the “movie guide”. If a
> recommendation was a couple of min. old it would not effect the end users
> experience.
>
>


-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: Spark Streaming Checkpoint help failed application

2015-11-11 Thread Gideon
Hi,

I'm no expert but

Short answer: yes, after restarting your application will reread the failed
messages

Longer answer: it seems like you're mixing several things together
Let me try and explain:
- WAL is used to prevent your application from losing data by making the
executor first write the data it receives from Kafka into WAL and only then
updating the Kafka high level consumer (what the receivers approach is
using) that it actually received the data (making it an at-least once)
- Checkpoints are a mechanism that helps your *driver* recover from failures
by saving driver information into HDFS (or S3 or whatever)

Now, the reason I explained these is this: you asked "... one bug caused the
streaming application to fail and exit" - so the failure you're trying to
solve is in the driver. When you restart your application your driver will
go and fetch the information it last saved in the checkpoint (saved into
HDFS) and order to new executors (since the previous driver died so did the
executors) to continue consuming data. Since your executors are using the
receivers approach (as opposed to the directkafkastream) with WAL what will
happen is that when they (the executors) get started they will first execute
what was saved in the WAL and then read from the latest offsets saved in
Kafka (Zookeeper) which in your case means you won't lose data (the
executors first save the data to WAL then advance their offsets on Kafka)

If you decide to go for the  direct approach

  
then your driver will be the one (and only one) managing the offsets for
Kafka which means that some of the data the driver will save in the
checkpoint will be the Kafka offsets

I hope this helps :)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Checkpoint-help-failed-application-tp25347p25357.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread Adrian Tanase
I don’t think this answers your question but here’s how you would evaluate the 
model in realtime in a streaming app
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html

Maybe you can find a way to extract portions of MLLib and run them outside of 
spark – loading the precomputed model and calling .predict on it…

-adrian

From: Andy Davidson
Date: Tuesday, November 10, 2015 at 11:31 PM
To: "user @spark"
Subject: thought experiment: use spark ML to real time prediction

Lets say I have use spark ML to train a linear model. I know I can save and 
load the model to disk. I am not sure how I can use the model in a real time 
environment. For example I do not think I can return a “prediction” to the 
client using spark streaming easily. Also for some applications the extra 
latency created by the batch process might not be acceptable.

If I was not using spark I would re-implement the model I trained in my batch 
environment in a lang like Java  and implement a rest service that uses the 
model to create a prediction and return the prediction to the client. Many 
models make predictions using linear algebra. Implementing predictions is 
relatively easy if you have a good vectorized LA package. Is there a way to use 
a model I trained using spark ML outside of spark?

As a motivating example, even if its possible to return data to the client 
using spark streaming. I think the mini batch latency would not be acceptable 
for a high frequency stock trading system.

Kind regards

Andy

P.s. The examples I have seen so far use spark streaming to “preprocess” 
predictions. For example a recommender system might use what current users are 
watching to calculate “trending recommendations”. These are stored on disk and 
served up to users when the use the “movie guide”. If a recommendation was a 
couple of min. old it would not effect the end users experience.



Re: Start python script with SparkLauncher

2015-11-11 Thread Andrejs

Thanks Ted,
that helped me, it turned out that I wrongly formated the name of the 
server, I had to add spark:// in front of server name.

Cheers,
Andrejs
On 11/11/15 14:26, Ted Yu wrote:
Please take a look 
at launcher/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java 
to see how app.getInputStream() and app.getErrorStream() are handled.


In master branch, the Suite is located 
at core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java


FYI

On Wed, Nov 11, 2015 at 5:57 AM, Andrejs 
> wrote:


Hi all,
I'm trying to call a python script from a scala application. Below
is part of my code.
My problem is that it doesn't work, but it also doesn't provide
any error message, so I can't debug it.

val spark =new 
SparkLauncher().setSparkHome("/home/user/spark-1.4.1-bin-hadoop2.6")
   
.setAppResource("/home/user/MyCode/forSpark/wordcount.py").addPyFile("/home/andabe/MyCode/forSpark/wordcount.py")
   .setMaster("myServerName")
.setAppName("pytho2word")
   .launch();
println("finishing")
spark.waitFor();
println("finished")


Any help is appreciated.

Cheers,
Andrejs







Re: How to configure logging...

2015-11-11 Thread Andy Davidson
Hi Hitoshi

Looks like you have read
http://spark.apache.org/docs/latest/configuration.html#configuring-logging

On my ec2 cluster I need to also do the following. I think my notes are not
complete. I think you may also need to restart your cluster

Hope this helps

Andy


#
# setting up logger so logging goes to file, makes demo easier to understand
#
ssh -i $KEY_FILE root@$MASTER
cp /home/ec2-user/log4j.properties  /root/spark/conf/
for i in `cat /root/spark/conf/slaves`; do scp
/home/ec2-user/log4j.properties root@$i:/home/ec2-user/log4j.properties;
done


#
# restart spark
#
/root/spark/sbin/stop-all.sh
/root/spark/sbin/start-all.sh


From:  Hitoshi 
Date:  Tuesday, November 10, 2015 at 1:22 PM
To:  "user @spark" 
Subject:  Re: How to configure logging...

> I don't have akka but with just Spark, I just edited log4j.properties to
> "log4j.rootCategory=ERROR, console" and ran the following command and was
> able to get only the Time row as output.
> 
> run-example org.apache.spark.examples.streaming.JavaNetworkWordCount
> localhost 
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-configure-logging-t
> p25346p25348.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 




Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Shixiong Zhu
In addition, if you have more than two text files, you can just put them
into a Seq and use "reduce(_ ++ _)".

Best Regards,
Shixiong Zhu

2015-11-11 10:21 GMT-08:00 Jakob Odersky :

> Hey Jeff,
> Do you mean reading from multiple text files? In that case, as a
> workaround, you can use the RDD#union() (or ++) method to concatenate
> multiple rdds. For example:
>
> val lines1 = sc.textFile("file1")
> val lines2 = sc.textFile("file2")
>
> val rdd = lines1 union lines2
>
> regards,
> --Jakob
>
> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>
>> Although user can use the hdfs glob syntax to support multiple inputs.
>> But sometimes, it is not convenient to do that. Not sure why there's no api
>> of SparkContext#textFiles. It should be easy to implement that. I'd love to
>> create a ticket and contribute for that if there's no other consideration
>> that I don't know.
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


Re: Spark on YARN using Java 1.8 fails

2015-11-11 Thread mvle
Unfortunately, no. I switched back to OpenJDK 1.7.
Didn't get a chance to dig deeper.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-using-Java-1-8-fails-tp24925p25360.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jakob Odersky
Hey Jeff,
Do you mean reading from multiple text files? In that case, as a
workaround, you can use the RDD#union() (or ++) method to concatenate
multiple rdds. For example:

val lines1 = sc.textFile("file1")
val lines2 = sc.textFile("file2")

val rdd = lines1 union lines2

regards,
--Jakob

On 11 November 2015 at 01:20, Jeff Zhang  wrote:

> Although user can use the hdfs glob syntax to support multiple inputs. But
> sometimes, it is not convenient to do that. Not sure why there's no api
> of SparkContext#textFiles. It should be easy to implement that. I'd love to
> create a ticket and contribute for that if there's no other consideration
> that I don't know.
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Spark on YARN using Java 1.8 fails

2015-11-11 Thread Abel Rincón
Hi,

There was another related question

https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201506.mbox/%3CCAJ2peNeruM2Y2Tbf8-Wiras-weE586LM_o25FsN=+z1-bfw...@mail.gmail.com%3E


Some months ago, if I remember well, using spark 1.3 + YARN + Java 8 we had
the same probem.
https://issues.apache.org/jira/browse/SPARK-6388

BTW, nowadays we chose use java 7


Creating new Spark context when running in Secure YARN fails

2015-11-11 Thread mvle
Hi,

I've deployed a Secure YARN 2.7.1 cluster with HDFS encryption and am trying
to run the pyspark shell using Spark 1.5.1

pyspark shell works and I can run a sample code to calculate PI just fine.
However, when I try to stop the current context (e.g., sc.stop()) and then
create a new context (sc = SparkContext()), I get the error below.

I have also seen errors such as: "token (HDFS_DELEGATION_TOKEN token 42 for
hadoop) can't be found in cache",

Does anyone know if it is possible to stop and create a new Spark context
within a single JVM process (driver) and have that work when dealing with
delegation tokens from Secure YARN/HDFS?

Thanks.

15/11/11 10:19:53 INFO yarn.Client: Setting up container launch context for
our AM
15/11/11 10:19:53 INFO yarn.Client: Setting up the launch environment for
our AM container
15/11/11 10:19:53 INFO yarn.Client: Credentials file set to:
credentials-37915c3e-1e90-44b9-add1-521598cea846
15/11/11 10:19:53 INFO yarn.YarnSparkHadoopUtil: getting token for namenode:
hdfs://test6-allwkrbsec-001:9000/user/hadoop/.sparkStaging/application_1446695132208_0042
15/11/11 10:19:53 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token
can be issued only with kerberos or web authentication
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:6638)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:563)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:987)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:933)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1044)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1543)
at
org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:530)
at
org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:508)
at
org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2228)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:126)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:123)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokensForNamenodes(YarnSparkHadoopUtil.scala:123)
at
org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:495)
at
org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:528)
at
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:628)
at
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at 

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread PhuDuc Nguyen
Dean,

Thanks for the reply. I'm searching (via spark mailing list archive and
google) and can't find the previous thread you mentioned. I've stumbled
upon a few but may not be the thread you're referring to. I'm very
interested in reading that discussion and any links/keywords would be
greatly appreciated.

I can see it's a non-trivial problem to solve for every use case in
streaming and thus not yet supported in general. However, I think (maybe
naively) it can be solved for specific use cases. If I use the available
features to create a fault tolerant design - i.e. failures/dead nodes can
occur on master nodes, driver node, or executor nodes without data loss and
"at-least-once" semantics is acceptable - then can't I safely scale down in
streaming by killing executors? If this is not true, then are we saying
that streaming is not fault tolerant?

I know it won't be as simple as setting a config like
spark.dyanmicAllocation.enabled=true and magically we'll have elastic
streaming, but I'm interested if anyone else has attempted to solve this
for their specific use case with extra coding involved? Pitfalls? Thoughts?

thanks,
Duc




On Wed, Nov 11, 2015 at 8:36 AM, Dean Wampler  wrote:

> Dynamic allocation doesn't work yet with Spark Streaming in any cluster
> scenario. There was a previous thread on this topic which discusses the
> issues that need to be resolved.
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Wed, Nov 11, 2015 at 8:09 AM, PhuDuc Nguyen 
> wrote:
>
>> I'm trying to get Spark Streaming to scale up/down its number of
>> executors within Mesos based on workload. It's not scaling down. I'm using
>> Spark 1.5.1 reading from Kafka using the direct (receiver-less) approach.
>>
>> Based on this ticket https://issues.apache.org/jira/browse/SPARK-6287
>> with the right configuration, I have a simple example working with the
>> spark-shell connected to a Mesos cluster. By working I mean the number of
>> executors scales up/down based on workload. However, the spark-shell is not
>> a streaming example.
>>
>> What is that status of dynamic resource allocation with Spark Streaming
>> on Mesos? Is it supported at all? Or supported but with some caveats to
>> ensure no data loss?
>>
>> thanks,
>> Duc
>>
>
>


Re: NullPointerException with joda time

2015-11-11 Thread Ted Yu
In case you need to adjust log4j properties, see the following thread:

http://search-hadoop.com/m/q3RTtJHkzb1t0J66=Re+Spark+Streaming+Log4j+Inside+Eclipse

Cheers

On Tue, Nov 10, 2015 at 1:28 PM, Ted Yu  wrote:

> I took a look at
> https://github.com/JodaOrg/joda-time/blob/master/src/main/java/org/joda/time/DateTime.java
> Looks like the NPE came from line below:
>
> long instant = getChronology().days().add(getMillis(), days);
> Maybe catch the NPE and print out the value of currentDate to see if
> there is more clue ?
>
> Cheers
>
> On Tue, Nov 10, 2015 at 12:55 PM, Romain Sagean 
> wrote:
>
>> see below a more complete version of the code.
>> the firstDate (previously minDate) should not be null, I even added an extra 
>> "filter( _._2 != null)" before the flatMap and the error is still there.
>>
>> What I don't understand is why I have the error on dateSeq.las.plusDays and 
>> not on dateSeq.last.isBefore (in the condition).
>>
>> I also tried changing the allDates function to use a while loop but i got 
>> the same error.
>>
>>   def allDates(dateStart: DateTime, dateEnd: DateTime): Seq[DateTime] = {
>> var dateSeq = Seq(dateStart)
>> var currentDate = dateStart
>> while (currentDate.isBefore(dateEnd)){
>>   dateSeq = dateSeq :+ currentDate
>>   currentDate = currentDate.plusDays(1)
>> }
>> return dateSeq
>>   }
>>
>> val videoAllDates = events.select("player_id", "current_ts")  
>> .filter("player_id is not null")  .filter("current_ts is not null")  
>> .map( row => (row.getString(0), timestampToDate(row.getString(1  
>> .filter(r => r._2.isAfter(minimumDate))  .reduceByKey(minDateTime)  
>> .flatMapValues( firstDate => allDates(firstDate, endDate))
>>
>>
>> And the stack trace.
>>
>> 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
>> output locations for shuffle 2 to sparkexecu...@r610-2.pro.hupi.loc:50821
>> 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses
>> for shuffle 2 is 695 bytes
>> 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
>> output locations for shuffle 1 to sparkexecu...@r610-2.pro.hupi.loc:50821
>> 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses
>> for shuffle 1 is 680 bytes
>> 15/11/10 21:10:36 INFO TaskSetManager: Starting task 206.0 in stage 3.0
>> (TID 798, R610-2.pro.hupi.loc, PROCESS_LOCAL, 4416 bytes)
>> 15/11/10 21:10:36 WARN TaskSetManager: Lost task 205.0 in stage 3.0 (TID
>> 797, R610-2.pro.hupi.loc): java.lang.NullPointerException
>> at org.joda.time.DateTime.plusDays(DateTime.java:1070)
>> at Heatmap$.allDates(heatmap.scala:34)
>> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
>> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:686)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:685)
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at
>> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>> at
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
>> at
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>> at
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>> at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> 

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Dean Wampler
Dynamic allocation doesn't work yet with Spark Streaming in any cluster
scenario. There was a previous thread on this topic which discusses the
issues that need to be resolved.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Wed, Nov 11, 2015 at 8:09 AM, PhuDuc Nguyen 
wrote:

> I'm trying to get Spark Streaming to scale up/down its number of executors
> within Mesos based on workload. It's not scaling down. I'm using Spark
> 1.5.1 reading from Kafka using the direct (receiver-less) approach.
>
> Based on this ticket https://issues.apache.org/jira/browse/SPARK-6287
> with the right configuration, I have a simple example working with the
> spark-shell connected to a Mesos cluster. By working I mean the number of
> executors scales up/down based on workload. However, the spark-shell is not
> a streaming example.
>
> What is that status of dynamic resource allocation with Spark Streaming on
> Mesos? Is it supported at all? Or supported but with some caveats to ensure
> no data loss?
>
> thanks,
> Duc
>


Re: anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-11 Thread Tom Graves
Is there anything other then the spark assembly that needs to be in the 
classpath?  I verified the assembly was built right and its in the classpath 
(else nothing would work).
Thanks,Tom 


 On Tuesday, November 10, 2015 8:29 PM, Shivaram Venkataraman 
 wrote:
   

 I think this is happening in the driver. Could you check the classpath
of the JVM that gets started ? If you use spark-submit on yarn the
classpath is setup before R gets launched, so it should match the
behavior of Scala / Python.

Thanks
Shivaram

On Fri, Nov 6, 2015 at 1:39 PM, Tom Graves  wrote:
> I'm trying to use the netlib-java stuff with mllib and sparkR on yarn. I've
> compiled with -Pnetlib-lgpl, see the necessary things in the spark assembly
> jar.  The nodes have  /usr/lib64/liblapack.so.3, /usr/lib64/libblas.so.3,
> and /usr/lib/libgfortran.so.3.
>
>
> Running:
> data <- read.df(sqlContext, 'data.csv', 'com.databricks.spark.csv')
> mdl = glm(C2~., data, family="gaussian")
>
> But I get the error:
> 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from:
> com.github.fommil.netlib.NativeSystemLAPACK
> 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefLAPACK
> 15/11/06 21:17:27 ERROR RBackendHandler: fitRModelFormula on
> org.apache.spark.ml.api.r.SparkRWrappers failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>  java.lang.AssertionError: assertion failed: lapack.dpotrs returned 18.
>        at scala.Predef$.assert(Predef.scala:179)
>        at
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40)
>        at
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:114)
>        at
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:166)
>        at
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:65)
>        at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>        at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>        at
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:138)
>        at
> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
>
> Anyone have this working?
>
> Thanks,
> Tom

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



  

Re: Start python script with SparkLauncher

2015-11-11 Thread Ted Yu
Please take a look
at launcher/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java
to see how app.getInputStream() and app.getErrorStream() are handled.

In master branch, the Suite is located
at core/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java

FYI

On Wed, Nov 11, 2015 at 5:57 AM, Andrejs 
wrote:

> Hi all,
> I'm trying to call a python script from a scala application. Below is part
> of my code.
> My problem is that it doesn't work, but it also doesn't provide any error
> message, so I can't debug it.
>
> val spark = new SparkLauncher()  
> .setSparkHome("/home/user/spark-1.4.1-bin-hadoop2.6")
>   .setAppResource("/home/user/MyCode/forSpark/wordcount.py")  
> .addPyFile("/home/andabe/MyCode/forSpark/wordcount.py")
>   .setMaster("myServerName")
>.setAppName("pytho2word")
>   .launch();println("finishing")
> spark.waitFor();println("finished")
>
>
> Any help is appreciated.
>
> Cheers,
> Andrejs
>
>
>


RE: Cassandra via SparkSQL/Hive JDBC

2015-11-11 Thread java8964
Any reason that Spark Cassandra connector won't work for you?
Yong

To: bryan.jeff...@gmail.com; user@spark.apache.org
From: bryan.jeff...@gmail.com
Subject: RE: Cassandra via SparkSQL/Hive JDBC
Date: Tue, 10 Nov 2015 22:42:13 -0500

Anyone have thoughts or a similar use-case for SparkSQL / Cassandra?

Regards,

Bryan JeffreyFrom: Bryan Jeffrey
Sent: ‎11/‎4/‎2015 11:16 AM
To: user
Subject: Cassandra via SparkSQL/Hive JDBC

Hello.
I have been working to add SparkSQL HDFS support to our application.  We're 
able to process streaming data, append to a persistent Hive table, and have 
that table available via JDBC/ODBC.  Now we're looking to access data in 
Cassandra via SparkSQL.  
In reading a number of previous posts, it appears that the way to do this is to 
instantiate a Spark Context, read the data into an RDD using the Cassandra 
Spark Connector, convert the data to a DF and register it as a temporary table. 
 The data will then be accessible via SparkSQL - although I assume that you 
would need to refresh the table on a periodic basis.
Is there a more straightforward way to do this?  Is it possible to register the 
Cassandra table with Hive so that the SparkSQL thrift server instance can just 
read data directly?
Regards,
Bryan Jeffrey 

Start python script with SparkLauncher

2015-11-11 Thread Andrejs

Hi all,
I'm trying to call a python script from a scala application. Below is 
part of my code.
My problem is that it doesn't work, but it also doesn't provide any 
error message, so I can't debug it.


val spark =new 
SparkLauncher().setSparkHome("/home/user/spark-1.4.1-bin-hadoop2.6")
  
.setAppResource("/home/user/MyCode/forSpark/wordcount.py").addPyFile("/home/andabe/MyCode/forSpark/wordcount.py")
  .setMaster("myServerName")
   .setAppName("pytho2word")
  .launch();
println("finishing")
spark.waitFor();
println("finished")


Any help is appreciated.

Cheers,
Andrejs




dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread PhuDuc Nguyen
I'm trying to get Spark Streaming to scale up/down its number of executors
within Mesos based on workload. It's not scaling down. I'm using Spark
1.5.1 reading from Kafka using the direct (receiver-less) approach.

Based on this ticket https://issues.apache.org/jira/browse/SPARK-6287 with
the right configuration, I have a simple example working with the
spark-shell connected to a Mesos cluster. By working I mean the number of
executors scales up/down based on workload. However, the spark-shell is not
a streaming example.

What is that status of dynamic resource allocation with Spark Streaming on
Mesos? Is it supported at all? Or supported but with some caveats to ensure
no data loss?

thanks,
Duc


Re: Status of 2.11 support?

2015-11-11 Thread Jakob Odersky
Hi Sukant,

Regarding the first point: when building spark during my daily work, I
always use Scala 2.11 and have only run into build problems once. Assuming
a working build I have never had any issues with the resulting artifacts.

More generally however, I would advise you to go with Scala 2.11 under all
circumstances. Scala 2.10 has reached end-of-life and, from what I make out
of your question, you have the opportunity to switch to a newer technology,
so why stay with legacy? Furthermore, Scala 2.12 will be coming out early
next year, so I reckon that Spark will switch to Scala 2.11 by default
pretty soon*.

regards,
--Jakob

*I'm myself pretty new to the Spark community so please don't take my words
on it as gospel


On 11 November 2015 at 15:25, Ted Yu  wrote:

> For #1, the published jars are usable.
> However, you should build from source for your specific combination of
> profiles.
>
> Cheers
>
> On Wed, Nov 11, 2015 at 3:22 PM, shajra-cogscale <
> sha...@cognitivescale.com> wrote:
>
>> Hi,
>>
>> My company isn't using Spark in production yet, but we are using a bit of
>> Scala.  There's a few people who have wanted to be conservative and keep
>> our
>> Scala at 2.10 in the event we start using Spark.  There are others who
>> want
>> to move to 2.11 with the idea that by the time we're using Spark it will
>> be
>> more or less 2.11-ready.
>>
>> It's hard to make a strong judgement on these kinds of things without
>> getting some community feedback.
>>
>> Looking through the internet I saw:
>>
>> 1) There's advice to build 2.11 packages from source -- but also published
>> jars to Maven Central for 2.11.  Are these jars on Maven Central usable
>> and
>> the advice to build from source outdated?
>>
>> 2)  There's a note that the JDBC RDD isn't 2.11-compliant.  This is okay
>> for
>> us, but is there anything else to worry about?
>>
>> It would be nice to get some answers to those questions as well as any
>> other
>> feedback from maintainers or anyone that's used Spark with Scala 2.11
>> beyond
>> simple examples.
>>
>> Thanks,
>> Sukant
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-2-11-support-tp25362.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


graphx - trianglecount of 2B edges

2015-11-11 Thread Vinod Mangipudi
I was attempting to use the graphx triangle count method on a 2B edge graph 
(Friendster dataset on SNAP)  . I have access to a 60 node cluster with 90GB 
memory and 30v cores per node .
I am running into memory issues


 I am using 1000 partitions using the RandomVertexCut. Here’s my submit script :

spark-submit --executor-cores 5 --num-executors 100 --executor-memory 32g 
--driver-memory 6g --conf spark.yarn.executor.memoryOverhead=8000  --conf 
"spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit”  
trianglecount_2.10-1.0.jar

There was one particular stage where it shuffled 3.7 TB

Active Stages (1)

Stage IdDescription Submitted   DurationTasks: 
Succeeded/Total  Input   Output  Shuffle ReadShuffle Write
11  (kill 
)mapPartitions
 at VertexRDDImpl.scala:218 
+details
 

 

  2015/11/12 01:33:06 7.3 min 
316/344
22.6 GB 57.0 GB 3.7 TB
In this subsequent stage it fails reading the Shuffle around the half terabyte 
mark with a java.lang.OutOfMemoryError: Java heap space


Active Stages (1)

Stage IdDescription Submitted   DurationTasks: 
Succeeded/Total  Input   Output  Shuffle ReadShuffle Write
12  (kill 
)mapPartitions
 at GraphImpl.scala:235 
+details
2015/11/12 01:41:25 5.2 min 
0/1000
26.3 GB 533.8 GB




Compared to the benchmarking (http://arxiv.org/pdf/1402.2394v1.pdf 
) cluster used on the twitter dataset 
(2.5B edges) the resources i am providing for the job seem to be reasonable. 
Can anyone point out any optimization or other tweaks i need to perform to get 
this to work ?

Thanks!
Vinod

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
IIRC, TextInputFormat supports an input path that is a comma separated
list. I haven't tried this, but I think you should just be able to do
sc.textFile("file1,file2,...")

On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:

> I know these workaround, but wouldn't it be more convenient and
> straightforward to use SparkContext#textFiles ?
>
> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
> wrote:
>
>> For more than a small number of files, you'd be better off using
>> SparkContext#union instead of RDD#union.  That will avoid building up a
>> lengthy lineage.
>>
>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
>> wrote:
>>
>>> Hey Jeff,
>>> Do you mean reading from multiple text files? In that case, as a
>>> workaround, you can use the RDD#union() (or ++) method to concatenate
>>> multiple rdds. For example:
>>>
>>> val lines1 = sc.textFile("file1")
>>> val lines2 = sc.textFile("file2")
>>>
>>> val rdd = lines1 union lines2
>>>
>>> regards,
>>> --Jakob
>>>
>>> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>>>
 Although user can use the hdfs glob syntax to support multiple inputs.
 But sometimes, it is not convenient to do that. Not sure why there's no api
 of SparkContext#textFiles. It should be easy to implement that. I'd love to
 create a ticket and contribute for that if there's no other consideration
 that I don't know.

 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


RE: hdfs-ha on mesos - odd bug

2015-11-11 Thread Buttler, David

I have verified that this error exists on my system as well, and the suggested 
workaround also works.
Spark version: 1.5.1; 1.5.2
Mesos version: 0.21.1
CDH version: 4.7

I have set up the spark-env.sh to contain HADOOP_CONF_DIR pointing to the 
correct place, and I have also linked in the hdfs-site.xml file to 
$SPARK_HOME/conf.  I agree that it should work, but it doesn't.

I have also tried including the correct Hadoop configuration files in the 
application jar.

Note: it works fine from spark-shell, but it doesn't work from spark-submit

Dave

-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com] 
Sent: Tuesday, September 15, 2015 7:47 PM
To: Adrian Bridgett
Cc: user
Subject: Re: hdfs-ha on mesos - odd bug

On Mon, Sep 14, 2015 at 6:55 AM, Adrian Bridgett  wrote:
> 15/09/14 13:00:25 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 
> 0,
> 10.1.200.245): java.lang.IllegalArgumentException:
> java.net.UnknownHostException: nameservice1
> at
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
> at
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxie
> s.java:310)

This looks like you're trying to connect to an HA HDFS service but you have not 
provided the proper hdfs-site.xml for your app; then, instead of recognizing 
"nameservice1" as an HA nameservice, it thinks it's an actual NN address, tries 
to connect to it, and fails.

If you provide the correct hdfs-site.xml to your app (by placing it in 
$SPARK_HOME/conf or setting HADOOP_CONF_DIR to point to the conf directory), it 
should work.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
Looks like what I was suggesting doesn't work. :/

On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang  wrote:

> Yes, that's what I suggest. TextInputFormat support multiple inputs. So in
> spark side, we just need to provide API to for that.
>
> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota 
> wrote:
>
>> IIRC, TextInputFormat supports an input path that is a comma separated
>> list. I haven't tried this, but I think you should just be able to do
>> sc.textFile("file1,file2,...")
>>
>> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:
>>
>>> I know these workaround, but wouldn't it be more convenient and
>>> straightforward to use SparkContext#textFiles ?
>>>
>>> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
>>> wrote:
>>>
 For more than a small number of files, you'd be better off using
 SparkContext#union instead of RDD#union.  That will avoid building up a
 lengthy lineage.

 On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
 wrote:

> Hey Jeff,
> Do you mean reading from multiple text files? In that case, as a
> workaround, you can use the RDD#union() (or ++) method to concatenate
> multiple rdds. For example:
>
> val lines1 = sc.textFile("file1")
> val lines2 = sc.textFile("file2")
>
> val rdd = lines1 union lines2
>
> regards,
> --Jakob
>
> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>
>> Although user can use the hdfs glob syntax to support multiple
>> inputs. But sometimes, it is not convenient to do that. Not sure why
>> there's no api of SparkContext#textFiles. It should be easy to implement
>> that. I'd love to create a ticket and contribute for that if there's no
>> other consideration that I don't know.
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>

>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: how to run unit test for specific component only

2015-11-11 Thread Ted Yu
Have you tried the following ?

build/sbt "sql/test-only *"

Cheers

On Wed, Nov 11, 2015 at 7:13 PM, weoccc  wrote:

> Hi,
>
> I am wondering how to run unit test for specific spark component only.
>
> mvn test -DwildcardSuites="org.apache.spark.sql.*" -Dtest=none
>
> The above command doesn't seem to work. I'm using spark 1.5.
>
> Thanks,
>
> Weide
>
>
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
I know these workaround, but wouldn't it be more convenient and
straightforward to use SparkContext#textFiles ?

On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
wrote:

> For more than a small number of files, you'd be better off using
> SparkContext#union instead of RDD#union.  That will avoid building up a
> lengthy lineage.
>
> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
> wrote:
>
>> Hey Jeff,
>> Do you mean reading from multiple text files? In that case, as a
>> workaround, you can use the RDD#union() (or ++) method to concatenate
>> multiple rdds. For example:
>>
>> val lines1 = sc.textFile("file1")
>> val lines2 = sc.textFile("file2")
>>
>> val rdd = lines1 union lines2
>>
>> regards,
>> --Jakob
>>
>> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>>
>>> Although user can use the hdfs glob syntax to support multiple inputs.
>>> But sometimes, it is not convenient to do that. Not sure why there's no api
>>> of SparkContext#textFiles. It should be easy to implement that. I'd love to
>>> create a ticket and contribute for that if there's no other consideration
>>> that I don't know.
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>


-- 
Best Regards

Jeff Zhang


how to run unit test for specific component only

2015-11-11 Thread weoccc
Hi,

I am wondering how to run unit test for specific spark component only.

mvn test -DwildcardSuites="org.apache.spark.sql.*" -Dtest=none

The above command doesn't seem to work. I'm using spark 1.5.

Thanks,

Weide


Re: Status of 2.11 support?

2015-11-11 Thread Ted Yu
I started playing with Scala 2.12.0-M3 but the compilation didn't pass (as
expected)

Planning to get back to 2.12 once it is released.

FYI

On Wed, Nov 11, 2015 at 4:34 PM, Jakob Odersky  wrote:

> Hi Sukant,
>
> Regarding the first point: when building spark during my daily work, I
> always use Scala 2.11 and have only run into build problems once. Assuming
> a working build I have never had any issues with the resulting artifacts.
>
> More generally however, I would advise you to go with Scala 2.11 under all
> circumstances. Scala 2.10 has reached end-of-life and, from what I make out
> of your question, you have the opportunity to switch to a newer technology,
> so why stay with legacy? Furthermore, Scala 2.12 will be coming out early
> next year, so I reckon that Spark will switch to Scala 2.11 by default
> pretty soon*.
>
> regards,
> --Jakob
>
> *I'm myself pretty new to the Spark community so please don't take my
> words on it as gospel
>
>
> On 11 November 2015 at 15:25, Ted Yu  wrote:
>
>> For #1, the published jars are usable.
>> However, you should build from source for your specific combination of
>> profiles.
>>
>> Cheers
>>
>> On Wed, Nov 11, 2015 at 3:22 PM, shajra-cogscale <
>> sha...@cognitivescale.com> wrote:
>>
>>> Hi,
>>>
>>> My company isn't using Spark in production yet, but we are using a bit of
>>> Scala.  There's a few people who have wanted to be conservative and keep
>>> our
>>> Scala at 2.10 in the event we start using Spark.  There are others who
>>> want
>>> to move to 2.11 with the idea that by the time we're using Spark it will
>>> be
>>> more or less 2.11-ready.
>>>
>>> It's hard to make a strong judgement on these kinds of things without
>>> getting some community feedback.
>>>
>>> Looking through the internet I saw:
>>>
>>> 1) There's advice to build 2.11 packages from source -- but also
>>> published
>>> jars to Maven Central for 2.11.  Are these jars on Maven Central usable
>>> and
>>> the advice to build from source outdated?
>>>
>>> 2)  There's a note that the JDBC RDD isn't 2.11-compliant.  This is okay
>>> for
>>> us, but is there anything else to worry about?
>>>
>>> It would be nice to get some answers to those questions as well as any
>>> other
>>> feedback from maintainers or anyone that's used Spark with Scala 2.11
>>> beyond
>>> simple examples.
>>>
>>> Thanks,
>>> Sukant
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-2-11-support-tp25362.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


RE: Cassandra via SparkSQL/Hive JDBC

2015-11-11 Thread Mohammed Guller
Short answer: yes.

The Spark Cassandra Connector supports the data source API. So you can create a 
DataFrame that points directly to a Cassandra table. You can query it using the 
DataFrame API or the SQL/HiveQL interface.

If you want to see an example,  see slide# 27 and 28 in this deck that I 
presented at the Cassandra Summit 2015:
http://www.slideshare.net/mg007/ad-hoc-analytics-with-cassandra-and-spark


Mohammed

From: Bryan [mailto:bryan.jeff...@gmail.com]
Sent: Tuesday, November 10, 2015 7:42 PM
To: Bryan Jeffrey; user
Subject: RE: Cassandra via SparkSQL/Hive JDBC

Anyone have thoughts or a similar use-case for SparkSQL / Cassandra?

Regards,

Bryan Jeffrey

From: Bryan Jeffrey
Sent: ‎11/‎4/‎2015 11:16 AM
To: user
Subject: Cassandra via SparkSQL/Hive JDBC
Hello.

I have been working to add SparkSQL HDFS support to our application.  We're 
able to process streaming data, append to a persistent Hive table, and have 
that table available via JDBC/ODBC.  Now we're looking to access data in 
Cassandra via SparkSQL.

In reading a number of previous posts, it appears that the way to do this is to 
instantiate a Spark Context, read the data into an RDD using the Cassandra 
Spark Connector, convert the data to a DF and register it as a temporary table. 
 The data will then be accessible via SparkSQL - although I assume that you 
would need to refresh the table on a periodic basis.

Is there a more straightforward way to do this?  Is it possible to register the 
Cassandra table with Hive so that the SparkSQL thrift server instance can just 
read data directly?

Regards,

Bryan Jeffrey


Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread DB Tsai
Do you think it will be useful to separate those models and model
loader/writer code into another spark-ml-common jar without any spark
platform dependencies so users can load the models trained by Spark ML in
their application and run the prediction?


Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Wed, Nov 11, 2015 at 3:14 AM, Nirmal Fernando  wrote:

> As of now, we are basically serializing the ML model and then deserialize
> it for prediction at real time.
>
> On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase  wrote:
>
>> I don’t think this answers your question but here’s how you would
>> evaluate the model in realtime in a streaming app
>>
>> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html
>>
>> Maybe you can find a way to extract portions of MLLib and run them
>> outside of spark – loading the precomputed model and calling .predict on it…
>>
>> -adrian
>>
>> From: Andy Davidson
>> Date: Tuesday, November 10, 2015 at 11:31 PM
>> To: "user @spark"
>> Subject: thought experiment: use spark ML to real time prediction
>>
>> Lets say I have use spark ML to train a linear model. I know I can save
>> and load the model to disk. I am not sure how I can use the model in a real
>> time environment. For example I do not think I can return a “prediction” to
>> the client using spark streaming easily. Also for some applications the
>> extra latency created by the batch process might not be acceptable.
>>
>> If I was not using spark I would re-implement the model I trained in my
>> batch environment in a lang like Java  and implement a rest service that
>> uses the model to create a prediction and return the prediction to the
>> client. Many models make predictions using linear algebra. Implementing
>> predictions is relatively easy if you have a good vectorized LA package. Is
>> there a way to use a model I trained using spark ML outside of spark?
>>
>> As a motivating example, even if its possible to return data to the
>> client using spark streaming. I think the mini batch latency would not be
>> acceptable for a high frequency stock trading system.
>>
>> Kind regards
>>
>> Andy
>>
>> P.s. The examples I have seen so far use spark streaming to “preprocess”
>> predictions. For example a recommender system might use what current users
>> are watching to calculate “trending recommendations”. These are stored on
>> disk and served up to users when the use the “movie guide”. If a
>> recommendation was a couple of min. old it would not effect the end users
>> experience.
>>
>>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Mark Hamstra
For more than a small number of files, you'd be better off using
SparkContext#union instead of RDD#union.  That will avoid building up a
lengthy lineage.

On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky  wrote:

> Hey Jeff,
> Do you mean reading from multiple text files? In that case, as a
> workaround, you can use the RDD#union() (or ++) method to concatenate
> multiple rdds. For example:
>
> val lines1 = sc.textFile("file1")
> val lines2 = sc.textFile("file2")
>
> val rdd = lines1 union lines2
>
> regards,
> --Jakob
>
> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>
>> Although user can use the hdfs glob syntax to support multiple inputs.
>> But sometimes, it is not convenient to do that. Not sure why there's no api
>> of SparkContext#textFiles. It should be easy to implement that. I'd love to
>> create a ticket and contribute for that if there's no other consideration
>> that I don't know.
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


Re: Spark Packages Configuration Not Found

2015-11-11 Thread Jakob Odersky
As another, general question, are spark packages the go-to way of extending
spark functionality? In my specific use-case I would like to start spark
(be it spark-shell or other) and hook into the listener API.
Since I wasn't able to find much documentation about spark packages, I was
wondering if they are still actively being developed?

thanks,
--Jakob

On 10 November 2015 at 14:58, Jakob Odersky  wrote:

> (accidental keyboard-shortcut sent the message)
> ... spark-shell from the spark 1.5.2 binary distribution.
> Also, running "spPublishLocal" has the same effect.
>
> thanks,
> --Jakob
>
> On 10 November 2015 at 14:55, Jakob Odersky  wrote:
>
>> Hi,
>> I ran into in error trying to run spark-shell with an external package
>> that I built and published locally
>> using the spark-package sbt plugin (
>> https://github.com/databricks/sbt-spark-package).
>>
>> To my understanding, spark packages can be published simply as maven
>> artifacts, yet after running "publishLocal" in my package project (
>> https://github.com/jodersky/spark-paperui), the following command
>>
>>park-shell --packages
>> ch.jodersky:spark-paperui-server_2.10:0.1-SNAPSHOT
>>
>> gives an error:
>>
>> ::
>>
>> ::  UNRESOLVED DEPENDENCIES ::
>>
>> ::
>>
>> :: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
>> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
>> required from org.apache.spark#spark-submit-parent;1.0 default
>>
>> ::
>>
>>
>> :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
>> Exception in thread "main" java.lang.RuntimeException: [unresolved
>> dependency: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
>> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
>> required from org.apache.spark#spark-submit-parent;1.0 default]
>> at
>> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1011)
>> at
>> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:12
>>
>> Do I need to include some default configuration? If so where and how
>> should I do it? All other packages I looked at had no such thing.
>>
>> Btw, I am using spark-shell from a
>>
>>
>


Spark cluster with Java 8 using ./spark-ec2

2015-11-11 Thread Philipp Grulich
Hey,

i just saw this post.
http://qnalist.com/questions/5627042/spark-cluster-with-java-8-using-spark-ec2
and i have the same question.
How can I use java 8 with the  ./spark-ec2 script
Dose anybody has a solution?

Philipp


Re: Creating new Spark context when running in Secure YARN fails

2015-11-11 Thread Michael V Le

Hi Ted,

Thanks for reply.

I tried your patch but am having the same problem.

I ran:

./bin/pyspark --master yarn-client

>> sc.stop()
>> sc = SparkContext()

Same error dump as below.

Do I need to pass something to the new sparkcontext ?

Thanks,
Mike



From:   Ted Yu 
To: Michael V Le/Watson/IBM@IBMUS
Cc: user 
Date:   11/11/2015 01:55 PM
Subject:Re: Creating new Spark context when running in Secure YARN
fails



Looks like the delegation token should be renewed.

Mind trying the following ?

Thanks

diff --git
a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala

b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerB
index 20771f6..e3c4a5a 100644
---
a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
+++
b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
@@ -53,6 +53,12 @@ private[spark] class YarnClientSchedulerBackend(
     logDebug("ClientArguments called with: " + argsArrayBuf.mkString(" "))
     val args = new ClientArguments(argsArrayBuf.toArray, conf)
     totalExpectedExecutors = args.numExecutors
+    // SPARK-8851: In yarn-client mode, the AM still does the credentials
refresh. The driver
+    // reads the credentials from HDFS, just like the executors and
updates its own credentials
+    // cache.
+    if (conf.contains("spark.yarn.credentials.file")) {
+      YarnSparkHadoopUtil.get.startExecutorDelegationTokenRenewer(conf)
+    }
     client = new Client(args, conf)
     appId = client.submitApplication()

@@ -63,12 +69,6 @@ private[spark] class YarnClientSchedulerBackend(

     waitForApplication()

-    // SPARK-8851: In yarn-client mode, the AM still does the credentials
refresh. The driver
-    // reads the credentials from HDFS, just like the executors and
updates its own credentials
-    // cache.
-    if (conf.contains("spark.yarn.credentials.file")) {
-      YarnSparkHadoopUtil.get.startExecutorDelegationTokenRenewer(conf)
-    }
     monitorThread = asyncMonitorApplication()
     monitorThread.start()
   }

On Wed, Nov 11, 2015 at 10:23 AM, mvle  wrote:
  Hi,

  I've deployed a Secure YARN 2.7.1 cluster with HDFS encryption and am
  trying
  to run the pyspark shell using Spark 1.5.1

  pyspark shell works and I can run a sample code to calculate PI just
  fine.
  However, when I try to stop the current context (e.g., sc.stop()) and
  then
  create a new context (sc = SparkContext()), I get the error below.

  I have also seen errors such as: "token (HDFS_DELEGATION_TOKEN token 42
  for
  hadoop) can't be found in cache",

  Does anyone know if it is possible to stop and create a new Spark context
  within a single JVM process (driver) and have that work when dealing with
  delegation tokens from Secure YARN/HDFS?

  Thanks.

  15/11/11 10:19:53 INFO yarn.Client: Setting up container launch context
  for
  our AM
  15/11/11 10:19:53 INFO yarn.Client: Setting up the launch environment for
  our AM container
  15/11/11 10:19:53 INFO yarn.Client: Credentials file set to:
  credentials-37915c3e-1e90-44b9-add1-521598cea846
  15/11/11 10:19:53 INFO yarn.YarnSparkHadoopUtil: getting token for
  namenode:
  
hdfs://test6-allwkrbsec-001:9000/user/hadoop/.sparkStaging/application_1446695132208_0042

  15/11/11 10:19:53 ERROR spark.SparkContext: Error initializing
  SparkContext.
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation
  Token
  can be issued only with kerberos or web authentication
          at
  org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken
  (FSNamesystem.java:6638)
          at
  org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken
(NameNodeRpcServer.java:563)
          at
  
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken
(ClientNamenodeProtocolServerSideTranslatorPB.java:987)
          at
  org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos
  $ClientNamenodeProtocol$2.callBlockingMethod
  (ClientNamenodeProtocolProtos.java)
          at
  org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call
  (ProtobufRpcEngine.java:616)
          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
          at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:415)
          at
  org.apache.hadoop.security.UserGroupInformation.doAs
  (UserGroupInformation.java:1657)
          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

          at org.apache.hadoop.ipc.Client.call(Client.java:1476)
          at org.apache.hadoop.ipc.Client.call(Client.java:1407)
          at
  

Porting R code to SparkR

2015-11-11 Thread Sanjay Subramanian
Hi guys
This is possibly going to sound like a vague, stupid question but I have a 
problem to solve and I need help. So any which way I go is only up :-) 
I have a bunch of R scripts (I am not a R expert) and we are currently 
evaluating how to translate these R scripts to SparkR data frame syntax. The 
goal is to use the Spark R parallel-ization
As an example we are using say Corpus, tm_map , DocumentTermMatrix from the 
library("tm")
How do we translate these to SparkR syntax ?
Any pointers would be helpful.
thanks
sanjay 
  


Re: Slow stage?

2015-11-11 Thread Jakob Odersky
Hi Simone,
I'm afraid I don't have an answer to your question. However I noticed the
DAG figures in the attachment. How did you generate these? I am myself
working on a project in which I am trying to generate visual
representations of the spark scheduler DAG. If such a tool already exists,
I would greatly appreciate any pointers.

thanks,
--Jakob

On 9 November 2015 at 13:52, Simone Franzini  wrote:

> Hi all,
>
> I have a complex Spark job that is broken up in many stages.
> I have a couple of stages that are particularly slow: each task takes
> around 6 - 7 minutes. This stage is fairly complex as you can see from the
> attached DAG. However, by construction each of the outer joins will have
> only 0 or 1 record on each side.
> It seems to me that this stage is really slow. However, the execution
> timeline shows that almost 100% of the time is spent in actual execution
> time not reading/writing to/from disk or in other overheads.
> Does this make any sense? I.e. is it just that these operations are slow
> (and notice task size in term of data seems small)?
> Is the pattern of operations in the DAG good or is it terribly suboptimal?
> If so, how could it be improved?
>
>
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Spark Thrift doesn't start

2015-11-11 Thread Zhan Zhang
In the hive-site.xml, you can remove all configuration related to tez and give 
it a try again.

Thanks.

Zhan Zhang

On Nov 10, 2015, at 10:47 PM, DaeHyun Ryu 
> wrote:

Hi folks,

I configured tez as execution engine of Hive. After done that, whenever I 
started spark thrift server, it just stopped automatically.
I checked log and saw the following messages. My spark version is 1.4.1 and   
tez version is 0.7.0 (IBM BigInsights 4.1)
Does anyone have any idea on this ?

java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:353)
at 
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:116)
at 
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:163)
at 
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:161)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:168)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
at $iwC$$iwC.(:9)
at $iwC.(:18)
at (:20)
at .(:24)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:130)
at 
org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:122)
at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:324)
at 
org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:122)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:974)
at 
org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:157)
at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:106)
at 
org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:64)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:991)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at 

Re: Creating new Spark context when running in Secure YARN fails

2015-11-11 Thread Ted Yu
Looks like the delegation token should be renewed.

Mind trying the following ?

Thanks

diff --git
a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerB
index 20771f6..e3c4a5a 100644
---
a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
+++
b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
@@ -53,6 +53,12 @@ private[spark] class YarnClientSchedulerBackend(
 logDebug("ClientArguments called with: " + argsArrayBuf.mkString(" "))
 val args = new ClientArguments(argsArrayBuf.toArray, conf)
 totalExpectedExecutors = args.numExecutors
+// SPARK-8851: In yarn-client mode, the AM still does the credentials
refresh. The driver
+// reads the credentials from HDFS, just like the executors and
updates its own credentials
+// cache.
+if (conf.contains("spark.yarn.credentials.file")) {
+  YarnSparkHadoopUtil.get.startExecutorDelegationTokenRenewer(conf)
+}
 client = new Client(args, conf)
 appId = client.submitApplication()

@@ -63,12 +69,6 @@ private[spark] class YarnClientSchedulerBackend(

 waitForApplication()

-// SPARK-8851: In yarn-client mode, the AM still does the credentials
refresh. The driver
-// reads the credentials from HDFS, just like the executors and
updates its own credentials
-// cache.
-if (conf.contains("spark.yarn.credentials.file")) {
-  YarnSparkHadoopUtil.get.startExecutorDelegationTokenRenewer(conf)
-}
 monitorThread = asyncMonitorApplication()
 monitorThread.start()
   }

On Wed, Nov 11, 2015 at 10:23 AM, mvle  wrote:

> Hi,
>
> I've deployed a Secure YARN 2.7.1 cluster with HDFS encryption and am
> trying
> to run the pyspark shell using Spark 1.5.1
>
> pyspark shell works and I can run a sample code to calculate PI just fine.
> However, when I try to stop the current context (e.g., sc.stop()) and then
> create a new context (sc = SparkContext()), I get the error below.
>
> I have also seen errors such as: "token (HDFS_DELEGATION_TOKEN token 42 for
> hadoop) can't be found in cache",
>
> Does anyone know if it is possible to stop and create a new Spark context
> within a single JVM process (driver) and have that work when dealing with
> delegation tokens from Secure YARN/HDFS?
>
> Thanks.
>
> 15/11/11 10:19:53 INFO yarn.Client: Setting up container launch context for
> our AM
> 15/11/11 10:19:53 INFO yarn.Client: Setting up the launch environment for
> our AM container
> 15/11/11 10:19:53 INFO yarn.Client: Credentials file set to:
> credentials-37915c3e-1e90-44b9-add1-521598cea846
> 15/11/11 10:19:53 INFO yarn.YarnSparkHadoopUtil: getting token for
> namenode:
>
> hdfs://test6-allwkrbsec-001:9000/user/hadoop/.sparkStaging/application_1446695132208_0042
> 15/11/11 10:19:53 ERROR spark.SparkContext: Error initializing
> SparkContext.
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation
> Token
> can be issued only with kerberos or web authentication
> at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:6638)
> at
>
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:563)
> at
>
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:987)
> at
>
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
>
> at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
> at
>
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:933)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
>
> 

Re: Slow stage?

2015-11-11 Thread Mark Hamstra
Those are from the Application Web UI -- look for the "DAG Visualization"
and "Event Timeline" elements on Job and Stage pages.

On Wed, Nov 11, 2015 at 10:58 AM, Jakob Odersky  wrote:

> Hi Simone,
> I'm afraid I don't have an answer to your question. However I noticed the
> DAG figures in the attachment. How did you generate these? I am myself
> working on a project in which I am trying to generate visual
> representations of the spark scheduler DAG. If such a tool already exists,
> I would greatly appreciate any pointers.
>
> thanks,
> --Jakob
>
> On 9 November 2015 at 13:52, Simone Franzini 
> wrote:
>
>> Hi all,
>>
>> I have a complex Spark job that is broken up in many stages.
>> I have a couple of stages that are particularly slow: each task takes
>> around 6 - 7 minutes. This stage is fairly complex as you can see from the
>> attached DAG. However, by construction each of the outer joins will have
>> only 0 or 1 record on each side.
>> It seems to me that this stage is really slow. However, the execution
>> timeline shows that almost 100% of the time is spent in actual execution
>> time not reading/writing to/from disk or in other overheads.
>> Does this make any sense? I.e. is it just that these operations are slow
>> (and notice task size in term of data seems small)?
>> Is the pattern of operations in the DAG good or is it terribly
>> suboptimal? If so, how could it be improved?
>>
>>
>> Simone Franzini, PhD
>>
>> http://www.linkedin.com/in/simonefranzini
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Yes, that's what I suggest. TextInputFormat support multiple inputs. So in
spark side, we just need to provide API to for that.

On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota 
wrote:

> IIRC, TextInputFormat supports an input path that is a comma separated
> list. I haven't tried this, but I think you should just be able to do
> sc.textFile("file1,file2,...")
>
> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:
>
>> I know these workaround, but wouldn't it be more convenient and
>> straightforward to use SparkContext#textFiles ?
>>
>> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
>> wrote:
>>
>>> For more than a small number of files, you'd be better off using
>>> SparkContext#union instead of RDD#union.  That will avoid building up a
>>> lengthy lineage.
>>>
>>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
>>> wrote:
>>>
 Hey Jeff,
 Do you mean reading from multiple text files? In that case, as a
 workaround, you can use the RDD#union() (or ++) method to concatenate
 multiple rdds. For example:

 val lines1 = sc.textFile("file1")
 val lines2 = sc.textFile("file2")

 val rdd = lines1 union lines2

 regards,
 --Jakob

 On 11 November 2015 at 01:20, Jeff Zhang  wrote:

> Although user can use the hdfs glob syntax to support multiple inputs.
> But sometimes, it is not convenient to do that. Not sure why there's no 
> api
> of SparkContext#textFiles. It should be easy to implement that. I'd love 
> to
> create a ticket and contribute for that if there's no other consideration
> that I don't know.
>
> --
> Best Regards
>
> Jeff Zhang
>


>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang


Re: Anybody hit this issue in spark shell?

2015-11-11 Thread Ted Yu
I searched code base and confirmed that there is no class from
com.google.common.annotations being used.

However, there're classes from com.google.common
e.g.

import com.google.common.io.{ByteStreams, Files}
import com.google.common.net.InetAddresses

FYI

On Tue, Nov 10, 2015 at 11:22 AM, Shixiong Zhu  wrote:

> Scala compiler stores some metadata in the ScalaSig attribute. See the
> following link as an example:
>
>
> http://stackoverflow.com/questions/10130106/how-does-scala-know-the-difference-between-def-foo-and-def-foo/10130403#10130403
>
> As maven-shade-plugin doesn't recognize ScalaSig, it cannot fix the
> reference in it. Not sure if there is a Scala version of
> `maven-shade-plugin` to deal with it.
>
> Generally, annotations that will be shaded should not be used in the Scala
> codes. I'm wondering if we can expose this issue in the PR build. Because
> SBT build doesn't do the shading, now it's hard for us to find similar
> issues in the PR build.
>
> Best Regards,
> Shixiong Zhu
>
> 2015-11-09 18:47 GMT-08:00 Ted Yu :
>
>> Created https://github.com/apache/spark/pull/9585
>>
>> Cheers
>>
>> On Mon, Nov 9, 2015 at 6:39 PM, Josh Rosen 
>> wrote:
>>
>>> When we remove this, we should add a style-checker rule to ban the
>>> import so that it doesn't get added back by accident.
>>>
>>> On Mon, Nov 9, 2015 at 6:13 PM, Michael Armbrust >> > wrote:
>>>
 Yeah, we should probably remove that.

 On Mon, Nov 9, 2015 at 5:54 PM, Ted Yu  wrote:

> If there is no option to let shell skip processing @VisibleForTesting
> , should the annotation be dropped ?
>
> Cheers
>
> On Mon, Nov 9, 2015 at 5:50 PM, Marcelo Vanzin 
> wrote:
>
>> We've had this in the past when using "@VisibleForTesting" in classes
>> that for some reason the shell tries to process. QueryExecution.scala
>> seems to use that annotation and that was added recently, so that's
>> probably the issue.
>>
>> BTW, if anyone knows how Scala can find a reference to the original
>> Guava class even after shading, I'd really like to know. I've looked
>> several times and never found where the original class name is stored.
>>
>> On Mon, Nov 9, 2015 at 10:37 AM, Zhan Zhang 
>> wrote:
>> > Hi Folks,
>> >
>> > Does anybody meet the following issue? I use "mvn package -Phive
>> > -DskipTests” to build the package.
>> >
>> > Thanks.
>> >
>> > Zhan Zhang
>> >
>> >
>> >
>> > bin/spark-shell
>> > ...
>> > Spark context available as sc.
>> > error: error while loading QueryExecution, Missing dependency 'bad
>> symbolic
>> > reference. A signature in QueryExecution.class refers to term
>> annotations
>> > in package com.google.common which is not available.
>> > It may be completely missing from the current classpath, or the
>> version on
>> > the classpath might be incompatible with the version used when
>> compiling
>> > QueryExecution.class.', required by
>> >
>> /Users/zzhang/repo/spark/assembly/target/scala-2.10/spark-assembly-1.6.0-SNAPSHOT-hadoop2.2.0.jar(org/apache/spark/sql/execution/QueryExecution.class)
>> > :10: error: not found: value sqlContext
>> >import sqlContext.implicits._
>> >   ^
>> > :10: error: not found: value sqlContext
>> >import sqlContext.sql
>> >   ^
>>
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

>>>
>>
>


Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Saisai Shao
I think for receiver-less Streaming connectors like direct Kafka input
stream or hdfs connector, dynamic allocation could be worked compared to
other receiver-based streaming connectors, since for receiver-less
connectors, the behavior of streaming app is more like a normal Spark app,
so dynamic allocation could be worked well, but you have to tune the
backlog time, executor idle time as well as batch duration to make dynamic
allocation work well in streaming app.

Thanks
Jerry


On Thu, Nov 12, 2015 at 5:36 AM, PhuDuc Nguyen 
wrote:

> Awesome, thanks for the tip!
>
>
>
> On Wed, Nov 11, 2015 at 2:25 PM, Tathagata Das 
> wrote:
>
>> The reason the existing dynamic allocation does not work out of the box
>> for spark streaming is because the heuristics used for decided when to
>> scale up/down is not the right one for micro-batch workloads. It works
>> great for typical batch workloads. However you can use the underlying
>> developer API to add / remove executors to implement your own scaling
>> logic.
>>
>> 1. Use SparkContext.requestExecutor and SparkContext.killExecutor
>>
>> 2. Use StreamingListener to get the scheduling delay and processing
>> times, and use that do a request or kill executors.
>>
>> TD
>>
>> On Wed, Nov 11, 2015 at 9:48 AM, PhuDuc Nguyen 
>> wrote:
>>
>>> Dean,
>>>
>>> Thanks for the reply. I'm searching (via spark mailing list archive and
>>> google) and can't find the previous thread you mentioned. I've stumbled
>>> upon a few but may not be the thread you're referring to. I'm very
>>> interested in reading that discussion and any links/keywords would be
>>> greatly appreciated.
>>>
>>> I can see it's a non-trivial problem to solve for every use case in
>>> streaming and thus not yet supported in general. However, I think (maybe
>>> naively) it can be solved for specific use cases. If I use the available
>>> features to create a fault tolerant design - i.e. failures/dead nodes can
>>> occur on master nodes, driver node, or executor nodes without data loss and
>>> "at-least-once" semantics is acceptable - then can't I safely scale down in
>>> streaming by killing executors? If this is not true, then are we saying
>>> that streaming is not fault tolerant?
>>>
>>> I know it won't be as simple as setting a config like
>>> spark.dyanmicAllocation.enabled=true and magically we'll have elastic
>>> streaming, but I'm interested if anyone else has attempted to solve this
>>> for their specific use case with extra coding involved? Pitfalls? Thoughts?
>>>
>>> thanks,
>>> Duc
>>>
>>>
>>>
>>>
>>> On Wed, Nov 11, 2015 at 8:36 AM, Dean Wampler 
>>> wrote:
>>>
 Dynamic allocation doesn't work yet with Spark Streaming in any cluster
 scenario. There was a previous thread on this topic which discusses the
 issues that need to be resolved.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
  (O'Reilly)
 Typesafe 
 @deanwampler 
 http://polyglotprogramming.com

 On Wed, Nov 11, 2015 at 8:09 AM, PhuDuc Nguyen 
 wrote:

> I'm trying to get Spark Streaming to scale up/down its number of
> executors within Mesos based on workload. It's not scaling down. I'm using
> Spark 1.5.1 reading from Kafka using the direct (receiver-less) approach.
>
> Based on this ticket https://issues.apache.org/jira/browse/SPARK-6287
> with the right configuration, I have a simple example working with the
> spark-shell connected to a Mesos cluster. By working I mean the number of
> executors scales up/down based on workload. However, the spark-shell is 
> not
> a streaming example.
>
> What is that status of dynamic resource allocation with Spark
> Streaming on Mesos? Is it supported at all? Or supported but with some
> caveats to ensure no data loss?
>
> thanks,
> Duc
>


>>>
>>
>


Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Saisai Shao
Yeah, agreed. Only for some extreme streaming workload we designed to fit
the pattern of dynamic allocation that could be worked very well. In normal
cases, no executor will remain idle for long time, so frequently scale up
and ramp down of executors will bring large overhead and latency to
streaming app.

On Thu, Nov 12, 2015 at 10:43 AM, Tathagata Das  wrote:

> For receivers, you must enable write ahead logs for making sure you dont
> loose data.
> And tuning the backlog time, executor idle time etc would not work very
> well for streaming, as the micro-batch jobs are likely to use all the
> executors all the time, and no executor will remain idle for long. That is
> why the heuristic doesnt work that well.
>
>
> On Wed, Nov 11, 2015 at 6:32 PM, Saisai Shao 
> wrote:
>
>> I think for receiver-less Streaming connectors like direct Kafka input
>> stream or hdfs connector, dynamic allocation could be worked compared to
>> other receiver-based streaming connectors, since for receiver-less
>> connectors, the behavior of streaming app is more like a normal Spark app,
>> so dynamic allocation could be worked well, but you have to tune the
>> backlog time, executor idle time as well as batch duration to make dynamic
>> allocation work well in streaming app.
>>
>> Thanks
>> Jerry
>>
>>
>> On Thu, Nov 12, 2015 at 5:36 AM, PhuDuc Nguyen 
>> wrote:
>>
>>> Awesome, thanks for the tip!
>>>
>>>
>>>
>>> On Wed, Nov 11, 2015 at 2:25 PM, Tathagata Das 
>>> wrote:
>>>
 The reason the existing dynamic allocation does not work out of the box
 for spark streaming is because the heuristics used for decided when to
 scale up/down is not the right one for micro-batch workloads. It works
 great for typical batch workloads. However you can use the underlying
 developer API to add / remove executors to implement your own scaling
 logic.

 1. Use SparkContext.requestExecutor and SparkContext.killExecutor

 2. Use StreamingListener to get the scheduling delay and processing
 times, and use that do a request or kill executors.

 TD

 On Wed, Nov 11, 2015 at 9:48 AM, PhuDuc Nguyen 
 wrote:

> Dean,
>
> Thanks for the reply. I'm searching (via spark mailing list archive
> and google) and can't find the previous thread you mentioned. I've 
> stumbled
> upon a few but may not be the thread you're referring to. I'm very
> interested in reading that discussion and any links/keywords would be
> greatly appreciated.
>
> I can see it's a non-trivial problem to solve for every use case in
> streaming and thus not yet supported in general. However, I think (maybe
> naively) it can be solved for specific use cases. If I use the available
> features to create a fault tolerant design - i.e. failures/dead nodes can
> occur on master nodes, driver node, or executor nodes without data loss 
> and
> "at-least-once" semantics is acceptable - then can't I safely scale down 
> in
> streaming by killing executors? If this is not true, then are we saying
> that streaming is not fault tolerant?
>
> I know it won't be as simple as setting a config like
> spark.dyanmicAllocation.enabled=true and magically we'll have elastic
> streaming, but I'm interested if anyone else has attempted to solve this
> for their specific use case with extra coding involved? Pitfalls? 
> Thoughts?
>
> thanks,
> Duc
>
>
>
>
> On Wed, Nov 11, 2015 at 8:36 AM, Dean Wampler 
> wrote:
>
>> Dynamic allocation doesn't work yet with Spark Streaming in any
>> cluster scenario. There was a previous thread on this topic which 
>> discusses
>> the issues that need to be resolved.
>>
>> Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>>  (O'Reilly)
>> Typesafe 
>> @deanwampler 
>> http://polyglotprogramming.com
>>
>> On Wed, Nov 11, 2015 at 8:09 AM, PhuDuc Nguyen <
>> duc.was.h...@gmail.com> wrote:
>>
>>> I'm trying to get Spark Streaming to scale up/down its number of
>>> executors within Mesos based on workload. It's not scaling down. I'm 
>>> using
>>> Spark 1.5.1 reading from Kafka using the direct (receiver-less) 
>>> approach.
>>>
>>> Based on this ticket
>>> https://issues.apache.org/jira/browse/SPARK-6287 with the right
>>> configuration, I have a simple example working with the spark-shell
>>> connected to a Mesos cluster. By working I mean the number of executors
>>> scales up/down based on workload. However, the spark-shell is not a
>>> streaming example.
>>>
>>> What is that status of dynamic resource allocation 

Re: Creating new Spark context when running in Secure YARN fails

2015-11-11 Thread Ted Yu
I assume your config contains "spark.yarn.credentials.file" -
otherwise startExecutorDelegationTokenRenewer(conf) call would be skipped.

On Wed, Nov 11, 2015 at 12:16 PM, Michael V Le  wrote:

> Hi Ted,
>
> Thanks for reply.
>
> I tried your patch but am having the same problem.
>
> I ran:
>
> ./bin/pyspark --master yarn-client
>
> >> sc.stop()
> >> sc = SparkContext()
>
> Same error dump as below.
>
> Do I need to pass something to the new sparkcontext ?
>
> Thanks,
> Mike
>
> [image: Inactive hide details for Ted Yu ---11/11/2015 01:55:02 PM---Looks
> like the delegation token should be renewed. Mind trying the]Ted Yu
> ---11/11/2015 01:55:02 PM---Looks like the delegation token should be
> renewed. Mind trying the following ?
>
> From: Ted Yu 
> To: Michael V Le/Watson/IBM@IBMUS
> Cc: user 
> Date: 11/11/2015 01:55 PM
> Subject: Re: Creating new Spark context when running in Secure YARN fails
> --
>
>
>
> Looks like the delegation token should be renewed.
>
> Mind trying the following ?
>
> Thanks
>
> diff --git
> a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
> b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerB
> index 20771f6..e3c4a5a 100644
> ---
> a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
> +++
> b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
> @@ -53,6 +53,12 @@ private[spark] class YarnClientSchedulerBackend(
>  logDebug("ClientArguments called with: " + argsArrayBuf.mkString(" "))
>  val args = new ClientArguments(argsArrayBuf.toArray, conf)
>  totalExpectedExecutors = args.numExecutors
> +// SPARK-8851: In yarn-client mode, the AM still does the credentials
> refresh. The driver
> +// reads the credentials from HDFS, just like the executors and
> updates its own credentials
> +// cache.
> +if (conf.contains("spark.yarn.credentials.file")) {
> +  YarnSparkHadoopUtil.get.startExecutorDelegationTokenRenewer(conf)
> +}
>  client = new Client(args, conf)
>  appId = client.submitApplication()
>
> @@ -63,12 +69,6 @@ private[spark] class YarnClientSchedulerBackend(
>
>  waitForApplication()
>
> -// SPARK-8851: In yarn-client mode, the AM still does the credentials
> refresh. The driver
> -// reads the credentials from HDFS, just like the executors and
> updates its own credentials
> -// cache.
> -if (conf.contains("spark.yarn.credentials.file")) {
> -  YarnSparkHadoopUtil.get.startExecutorDelegationTokenRenewer(conf)
> -}
>  monitorThread = asyncMonitorApplication()
>  monitorThread.start()
>}
>
> On Wed, Nov 11, 2015 at 10:23 AM, mvle <*m...@us.ibm.com*
> > wrote:
>
>Hi,
>
>I've deployed a Secure YARN 2.7.1 cluster with HDFS encryption and am
>trying
>to run the pyspark shell using Spark 1.5.1
>
>pyspark shell works and I can run a sample code to calculate PI just
>fine.
>However, when I try to stop the current context (e.g., sc.stop()) and
>then
>create a new context (sc = SparkContext()), I get the error below.
>
>I have also seen errors such as: "token (HDFS_DELEGATION_TOKEN token
>42 for
>hadoop) can't be found in cache",
>
>Does anyone know if it is possible to stop and create a new Spark
>context
>within a single JVM process (driver) and have that work when dealing
>with
>delegation tokens from Secure YARN/HDFS?
>
>Thanks.
>
>15/11/11 10:19:53 INFO yarn.Client: Setting up container launch
>context for
>our AM
>15/11/11 10:19:53 INFO yarn.Client: Setting up the launch environment
>for
>our AM container
>15/11/11 10:19:53 INFO yarn.Client: Credentials file set to:
>credentials-37915c3e-1e90-44b9-add1-521598cea846
>15/11/11 10:19:53 INFO yarn.YarnSparkHadoopUtil: getting token for
>namenode:
>
>
> hdfs://test6-allwkrbsec-001:9000/user/hadoop/.sparkStaging/application_1446695132208_0042
>15/11/11 10:19:53 ERROR spark.SparkContext: Error initializing
>SparkContext.
>org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation
>Token
>can be issued only with kerberos or web authentication
>at
>
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:6638)
>at
>
>
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:563)
>at
>
>
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:987)
>at
>
>
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>at
>
>
> 

Re: Slow stage?

2015-11-11 Thread Koert Kuipers
i am a person that usually hates UIs, and i have to say i love these. very
useful

On Wed, Nov 11, 2015 at 3:23 PM, Mark Hamstra 
wrote:

> Those are from the Application Web UI -- look for the "DAG Visualization"
> and "Event Timeline" elements on Job and Stage pages.
>
> On Wed, Nov 11, 2015 at 10:58 AM, Jakob Odersky 
> wrote:
>
>> Hi Simone,
>> I'm afraid I don't have an answer to your question. However I noticed the
>> DAG figures in the attachment. How did you generate these? I am myself
>> working on a project in which I am trying to generate visual
>> representations of the spark scheduler DAG. If such a tool already exists,
>> I would greatly appreciate any pointers.
>>
>> thanks,
>> --Jakob
>>
>> On 9 November 2015 at 13:52, Simone Franzini 
>> wrote:
>>
>>> Hi all,
>>>
>>> I have a complex Spark job that is broken up in many stages.
>>> I have a couple of stages that are particularly slow: each task takes
>>> around 6 - 7 minutes. This stage is fairly complex as you can see from the
>>> attached DAG. However, by construction each of the outer joins will have
>>> only 0 or 1 record on each side.
>>> It seems to me that this stage is really slow. However, the execution
>>> timeline shows that almost 100% of the time is spent in actual execution
>>> time not reading/writing to/from disk or in other overheads.
>>> Does this make any sense? I.e. is it just that these operations are slow
>>> (and notice task size in term of data seems small)?
>>> Is the pattern of operations in the DAG good or is it terribly
>>> suboptimal? If so, how could it be improved?
>>>
>>>
>>> Simone Franzini, PhD
>>>
>>> http://www.linkedin.com/in/simonefranzini
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>


Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Tathagata Das
The reason the existing dynamic allocation does not work out of the box for
spark streaming is because the heuristics used for decided when to scale
up/down is not the right one for micro-batch workloads. It works great for
typical batch workloads. However you can use the underlying developer API
to add / remove executors to implement your own scaling logic.

1. Use SparkContext.requestExecutor and SparkContext.killExecutor

2. Use StreamingListener to get the scheduling delay and processing times,
and use that do a request or kill executors.

TD

On Wed, Nov 11, 2015 at 9:48 AM, PhuDuc Nguyen 
wrote:

> Dean,
>
> Thanks for the reply. I'm searching (via spark mailing list archive and
> google) and can't find the previous thread you mentioned. I've stumbled
> upon a few but may not be the thread you're referring to. I'm very
> interested in reading that discussion and any links/keywords would be
> greatly appreciated.
>
> I can see it's a non-trivial problem to solve for every use case in
> streaming and thus not yet supported in general. However, I think (maybe
> naively) it can be solved for specific use cases. If I use the available
> features to create a fault tolerant design - i.e. failures/dead nodes can
> occur on master nodes, driver node, or executor nodes without data loss and
> "at-least-once" semantics is acceptable - then can't I safely scale down in
> streaming by killing executors? If this is not true, then are we saying
> that streaming is not fault tolerant?
>
> I know it won't be as simple as setting a config like
> spark.dyanmicAllocation.enabled=true and magically we'll have elastic
> streaming, but I'm interested if anyone else has attempted to solve this
> for their specific use case with extra coding involved? Pitfalls? Thoughts?
>
> thanks,
> Duc
>
>
>
>
> On Wed, Nov 11, 2015 at 8:36 AM, Dean Wampler 
> wrote:
>
>> Dynamic allocation doesn't work yet with Spark Streaming in any cluster
>> scenario. There was a previous thread on this topic which discusses the
>> issues that need to be resolved.
>>
>> Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>>  (O'Reilly)
>> Typesafe 
>> @deanwampler 
>> http://polyglotprogramming.com
>>
>> On Wed, Nov 11, 2015 at 8:09 AM, PhuDuc Nguyen 
>> wrote:
>>
>>> I'm trying to get Spark Streaming to scale up/down its number of
>>> executors within Mesos based on workload. It's not scaling down. I'm using
>>> Spark 1.5.1 reading from Kafka using the direct (receiver-less) approach.
>>>
>>> Based on this ticket https://issues.apache.org/jira/browse/SPARK-6287
>>> with the right configuration, I have a simple example working with the
>>> spark-shell connected to a Mesos cluster. By working I mean the number of
>>> executors scales up/down based on workload. However, the spark-shell is not
>>> a streaming example.
>>>
>>> What is that status of dynamic resource allocation with Spark Streaming
>>> on Mesos? Is it supported at all? Or supported but with some caveats to
>>> ensure no data loss?
>>>
>>> thanks,
>>> Duc
>>>
>>
>>
>


Re: anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-11 Thread Shivaram Venkataraman
Nothing more -- The only two things I can think of are: (a) is there
something else on the classpath that comes before this lgpl JAR ? I've
seen cases where two versions of netlib-java on the classpath can mess
things up. (b) There is something about the way SparkR is using
reflection to invoke the ML Pipelines code that is breaking the BLAS
library discovery. I don't know of a good way to debug this yet
though.

Thanks
Shivaram

On Wed, Nov 11, 2015 at 5:55 AM, Tom Graves  wrote:
> Is there anything other then the spark assembly that needs to be in the
> classpath?  I verified the assembly was built right and its in the classpath
> (else nothing would work).
>
> Thanks,
> Tom
>
>
>
> On Tuesday, November 10, 2015 8:29 PM, Shivaram Venkataraman
>  wrote:
>
>
> I think this is happening in the driver. Could you check the classpath
> of the JVM that gets started ? If you use spark-submit on yarn the
> classpath is setup before R gets launched, so it should match the
> behavior of Scala / Python.
>
> Thanks
> Shivaram
>
> On Fri, Nov 6, 2015 at 1:39 PM, Tom Graves 
> wrote:
>> I'm trying to use the netlib-java stuff with mllib and sparkR on yarn.
>> I've
>> compiled with -Pnetlib-lgpl, see the necessary things in the spark
>> assembly
>> jar.  The nodes have  /usr/lib64/liblapack.so.3, /usr/lib64/libblas.so.3,
>> and /usr/lib/libgfortran.so.3.
>>
>>
>> Running:
>> data <- read.df(sqlContext, 'data.csv', 'com.databricks.spark.csv')
>> mdl = glm(C2~., data, family="gaussian")
>>
>> But I get the error:
>> 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from:
>> com.github.fommil.netlib.NativeSystemLAPACK
>> 15/11/06 21:17:27 WARN LAPACK: Failed to load implementation from:
>> com.github.fommil.netlib.NativeRefLAPACK
>> 15/11/06 21:17:27 ERROR RBackendHandler: fitRModelFormula on
>> org.apache.spark.ml.api.r.SparkRWrappers failed
>> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>>  java.lang.AssertionError: assertion failed: lapack.dpotrs returned 18.
>>at scala.Predef$.assert(Predef.scala:179)
>>at
>>
>> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40)
>>at
>>
>> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:114)
>>at
>>
>> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:166)
>>at
>>
>> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:65)
>>at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>>at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>>at
>> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:138)
>>at
>> org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
>>
>> Anyone have this working?
>>
>> Thanks,
>> Tom
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Different classpath across stages?

2015-11-11 Thread John Meehan
I’ve been running into a strange class not found problem, but only when my job 
has more than one phase.  I have an RDD[ProtobufClass] which behaves as 
expected in a single-stage job (e.g. serialize to JSON and export).  But when I 
try to groupByKey, the first stage runs (essentially a keyBy), but eventually 
errors with the relatively common ‘unable to find protocol buffer class’ error 
for the first task of the second stage.  I’ve tried the userClassPathFirst 
options, but then the whole job fails.  So I’m wondering if there is some kind 
of configuration I can use to help Spark resolve the right protocol buffer 
class across stage boundaries?

-John
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread PhuDuc Nguyen
Awesome, thanks for the tip!



On Wed, Nov 11, 2015 at 2:25 PM, Tathagata Das  wrote:

> The reason the existing dynamic allocation does not work out of the box
> for spark streaming is because the heuristics used for decided when to
> scale up/down is not the right one for micro-batch workloads. It works
> great for typical batch workloads. However you can use the underlying
> developer API to add / remove executors to implement your own scaling
> logic.
>
> 1. Use SparkContext.requestExecutor and SparkContext.killExecutor
>
> 2. Use StreamingListener to get the scheduling delay and processing times,
> and use that do a request or kill executors.
>
> TD
>
> On Wed, Nov 11, 2015 at 9:48 AM, PhuDuc Nguyen 
> wrote:
>
>> Dean,
>>
>> Thanks for the reply. I'm searching (via spark mailing list archive and
>> google) and can't find the previous thread you mentioned. I've stumbled
>> upon a few but may not be the thread you're referring to. I'm very
>> interested in reading that discussion and any links/keywords would be
>> greatly appreciated.
>>
>> I can see it's a non-trivial problem to solve for every use case in
>> streaming and thus not yet supported in general. However, I think (maybe
>> naively) it can be solved for specific use cases. If I use the available
>> features to create a fault tolerant design - i.e. failures/dead nodes can
>> occur on master nodes, driver node, or executor nodes without data loss and
>> "at-least-once" semantics is acceptable - then can't I safely scale down in
>> streaming by killing executors? If this is not true, then are we saying
>> that streaming is not fault tolerant?
>>
>> I know it won't be as simple as setting a config like
>> spark.dyanmicAllocation.enabled=true and magically we'll have elastic
>> streaming, but I'm interested if anyone else has attempted to solve this
>> for their specific use case with extra coding involved? Pitfalls? Thoughts?
>>
>> thanks,
>> Duc
>>
>>
>>
>>
>> On Wed, Nov 11, 2015 at 8:36 AM, Dean Wampler 
>> wrote:
>>
>>> Dynamic allocation doesn't work yet with Spark Streaming in any cluster
>>> scenario. There was a previous thread on this topic which discusses the
>>> issues that need to be resolved.
>>>
>>> Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>>  (O'Reilly)
>>> Typesafe 
>>> @deanwampler 
>>> http://polyglotprogramming.com
>>>
>>> On Wed, Nov 11, 2015 at 8:09 AM, PhuDuc Nguyen 
>>> wrote:
>>>
 I'm trying to get Spark Streaming to scale up/down its number of
 executors within Mesos based on workload. It's not scaling down. I'm using
 Spark 1.5.1 reading from Kafka using the direct (receiver-less) approach.

 Based on this ticket https://issues.apache.org/jira/browse/SPARK-6287
 with the right configuration, I have a simple example working with the
 spark-shell connected to a Mesos cluster. By working I mean the number of
 executors scales up/down based on workload. However, the spark-shell is not
 a streaming example.

 What is that status of dynamic resource allocation with Spark Streaming
 on Mesos? Is it supported at all? Or supported but with some caveats to
 ensure no data loss?

 thanks,
 Duc

>>>
>>>
>>
>


Re: Spark Packages Configuration Not Found

2015-11-11 Thread Burak Yavuz
Hi Jakob,
> As another, general question, are spark packages the go-to way of
extending spark functionality?

Definitely. There are ~150 Spark Packages out there in spark-packages.org.
I use a lot of them in every day Spark work.
The number of released packages have steadily increased rate over the last
few months.

> Since I wasn't able to find much documentation about spark packages, I
was wondering if they are still actively being developed?

I would love to work on the documentation. Some exist on spark-packages.org,
but there could be a lot more. If you have any specific questions, feel
free to submit them to me directly, and I'll incorporate them to a FAQ I'm
working on.

Regarding your initial problem: Unfortunately `spPublishLocal` is broken
due to ivy configuration mismatches between Spark and the sbt-spark-package
plugin. What you can do instead is:
```
$ sbt +spark-paperui-server/publishM2
$ spark-shell --packages ch.jodersky:spark-paperui-server_2.10:0.1-SNAPSHOT
```

Hopefully that should work for you.

Best,
Burak


On Wed, Nov 11, 2015 at 10:53 AM, Jakob Odersky  wrote:

> As another, general question, are spark packages the go-to way of
> extending spark functionality? In my specific use-case I would like to
> start spark (be it spark-shell or other) and hook into the listener API.
> Since I wasn't able to find much documentation about spark packages, I was
> wondering if they are still actively being developed?
>
> thanks,
> --Jakob
>
> On 10 November 2015 at 14:58, Jakob Odersky  wrote:
>
>> (accidental keyboard-shortcut sent the message)
>> ... spark-shell from the spark 1.5.2 binary distribution.
>> Also, running "spPublishLocal" has the same effect.
>>
>> thanks,
>> --Jakob
>>
>> On 10 November 2015 at 14:55, Jakob Odersky  wrote:
>>
>>> Hi,
>>> I ran into in error trying to run spark-shell with an external package
>>> that I built and published locally
>>> using the spark-package sbt plugin (
>>> https://github.com/databricks/sbt-spark-package).
>>>
>>> To my understanding, spark packages can be published simply as maven
>>> artifacts, yet after running "publishLocal" in my package project (
>>> https://github.com/jodersky/spark-paperui), the following command
>>>
>>>park-shell --packages
>>> ch.jodersky:spark-paperui-server_2.10:0.1-SNAPSHOT
>>>
>>> gives an error:
>>>
>>> ::
>>>
>>> ::  UNRESOLVED DEPENDENCIES ::
>>>
>>> ::
>>>
>>> :: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
>>> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
>>> required from org.apache.spark#spark-submit-parent;1.0 default
>>>
>>> ::
>>>
>>>
>>> :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
>>> Exception in thread "main" java.lang.RuntimeException: [unresolved
>>> dependency: ch.jodersky#spark-paperui-server_2.10;0.1: configuration not
>>> found in ch.jodersky#spark-paperui-server_2.10;0.1: 'default'. It was
>>> required from org.apache.spark#spark-submit-parent;1.0 default]
>>> at
>>> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1011)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:12
>>>
>>> Do I need to include some default configuration? If so where and how
>>> should I do it? All other packages I looked at had no such thing.
>>>
>>> Btw, I am using spark-shell from a
>>>
>>>
>>
>


Re: Creating new Spark context when running in Secure YARN fails

2015-11-11 Thread Michael V Le

It looks like my config does not have "spark.yarn.credentials.file".

I executed:
sc._conf.getAll()

[(u'spark.ssl.keyStore', u'xxx.keystore'), (u'spark.eventLog.enabled',
u'true'), (u'spark.ssl.keyStorePassword', u'XXX'),
(u'spark.yarn.principal', u'XXX'), (u'spark.master', u'yarn-client'),
(u'spark.ssl.keyPassword', u'XXX'),
(u'spark.authenticate.sasl.serverAlwaysEncrypt', u'true'),
(u'spark.ssl.trustStorePassword', u'XXX'), (u'spark.ssl.protocol',
u'TLSv1.2'), (u'spark.authenticate.enableSaslEncryption', u'true'),
(u'spark.app.name', u'PySparkShell'), (u'spark.yarn.keytab',
u'XXX.keytab'), (u'spark.yarn.historyServer.address', u'xxx-001:18080'),
(u'spark.rdd.compress', u'True'), (u'spark.eventLog.dir',
u'hdfs://xxx-001:9000/user/hadoop/sparklogs'),
(u'spark.ssl.enabledAlgorithms',
u'TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA'),
(u'spark.serializer.objectStreamReset', u'100'),
(u'spark.history.fs.logDirectory',
u'hdfs://xxx-001:9000/user/hadoop/sparklogs'), (u'spark.yarn.isPython',
u'true'), (u'spark.submit.deployMode', u'client'), (u'spark.ssl.enabled',
u'true'), (u'spark.authenticate', u'true'), (u'spark.ssl.trustStore',
u'xxx.truststore')]

I am not really familiar with "spark.yarn.credentials.file" and had thought
it was created automatically after communicating with YARN to get tokens.

Thanks,
Mike




From:   Ted Yu 
To: Michael V Le/Watson/IBM@IBMUS
Cc: user 
Date:   11/11/2015 03:35 PM
Subject:Re: Creating new Spark context when running in Secure YARN
fails



I assume your config contains "spark.yarn.credentials.file" -
otherwise startExecutorDelegationTokenRenewer(conf) call would be skipped.

On Wed, Nov 11, 2015 at 12:16 PM, Michael V Le  wrote:
  Hi Ted,

  Thanks for reply.

  I tried your patch but am having the same problem.

  I ran:

  ./bin/pyspark --master yarn-client

  >> sc.stop()
  >> sc = SparkContext()

  Same error dump as below.

  Do I need to pass something to the new sparkcontext ?

  Thanks,
  Mike

  Inactive hide details for Ted Yu ---11/11/2015 01:55:02 PM---Looks like
  the delegation token should be renewed. Mind trying theTed Yu
  ---11/11/2015 01:55:02 PM---Looks like the delegation token should be
  renewed. Mind trying the following ?

  From: Ted Yu 
  To: Michael V Le/Watson/IBM@IBMUS
  Cc: user 
  Date: 11/11/2015 01:55 PM
  Subject: Re: Creating new Spark context when running in Secure YARN fails




  Looks like the delegation token should be renewed.

  Mind trying the following ?

  Thanks

  diff --git
  
a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
 b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerB

  index 20771f6..e3c4a5a 100644
  ---
  
a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala

  +++
  
b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala

  @@ -53,6 +53,12 @@ private[spark] class YarnClientSchedulerBackend(
       logDebug("ClientArguments called with: " + argsArrayBuf.mkString("
  "))
       val args = new ClientArguments(argsArrayBuf.toArray, conf)
       totalExpectedExecutors = args.numExecutors
  +    // SPARK-8851: In yarn-client mode, the AM still does the
  credentials refresh. The driver
  +    // reads the credentials from HDFS, just like the executors and
  updates its own credentials
  +    // cache.
  +    if (conf.contains("spark.yarn.credentials.file")) {
  +      YarnSparkHadoopUtil.get.startExecutorDelegationTokenRenewer(conf)
  +    }
       client = new Client(args, conf)
       appId = client.submitApplication()

  @@ -63,12 +69,6 @@ private[spark] class YarnClientSchedulerBackend(

       waitForApplication()

  -    // SPARK-8851: In yarn-client mode, the AM still does the
  credentials refresh. The driver
  -    // reads the credentials from HDFS, just like the executors and
  updates its own credentials
  -    // cache.
  -    if (conf.contains("spark.yarn.credentials.file")) {
  -      YarnSparkHadoopUtil.get.startExecutorDelegationTokenRenewer(conf)
  -    }
       monitorThread = asyncMonitorApplication()
       monitorThread.start()
     }

  On Wed, Nov 11, 2015 at 10:23 AM, mvle  wrote:
Hi,

I've deployed a Secure YARN 2.7.1 cluster with HDFS encryption and
am trying
to run the pyspark shell using Spark 1.5.1

pyspark shell works and I can run a sample code to calculate PI
just fine.
However, when I try to stop the current context (e.g., sc.stop())
and then
create a new context (sc = SparkContext()), I get the error below.

I have also seen errors such as: "token (HDFS_DELEGATION_TOKEN
token 42 for
hadoop) can't be found in cache",

Does anyone know if it is possible to stop and create a new Spark
  

How can you sort wordcounts by counts in stateful_network_wordcount.py example

2015-11-11 Thread Amir Rahnama
Hey,

Anybody knows how can one sort the result in the stateful example?

Python would be prefered.

https://github.com/apache/spark/blob/859dff56eb0f8c63c86e7e900a12340c199e6247/examples/src/main/python/streaming/stateful_network_wordcount.py
-- 
Thanks and Regards,

Amir Hossein Rahnama

*Tel: +46 (0) 761 681 102*
Website: www.ambodi.com
Twitter: @_ambodi 


Re: Creating new Spark context when running in Secure YARN fails

2015-11-11 Thread Ted Yu
Please take a look at
yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala
where this config is described

Cheers

On Wed, Nov 11, 2015 at 1:45 PM, Michael V Le  wrote:

> It looks like my config does not have "spark.yarn.credentials.file".
>
> I executed:
> sc._conf.getAll()
>
> [(u'spark.ssl.keyStore', u'xxx.keystore'), (u'spark.eventLog.enabled',
> u'true'), (u'spark.ssl.keyStorePassword', u'XXX'),
> (u'spark.yarn.principal', u'XXX'), (u'spark.master', u'yarn-client'),
> (u'spark.ssl.keyPassword', u'XXX'),
> (u'spark.authenticate.sasl.serverAlwaysEncrypt', u'true'),
> (u'spark.ssl.trustStorePassword', u'XXX'), (u'spark.ssl.protocol',
> u'TLSv1.2'), (u'spark.authenticate.enableSaslEncryption', u'true'), (u'
> spark.app.name', u'PySparkShell'), (u'spark.yarn.keytab', u'XXX.keytab'),
> (u'spark.yarn.historyServer.address', u'xxx-001:18080'),
> (u'spark.rdd.compress', u'True'), (u'spark.eventLog.dir',
> u'hdfs://xxx-001:9000/user/hadoop/sparklogs'),
> (u'spark.ssl.enabledAlgorithms',
> u'TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA'),
> (u'spark.serializer.objectStreamReset', u'100'),
> (u'spark.history.fs.logDirectory',
> u'hdfs://xxx-001:9000/user/hadoop/sparklogs'), (u'spark.yarn.isPython',
> u'true'), (u'spark.submit.deployMode', u'client'), (u'spark.ssl.enabled',
> u'true'), (u'spark.authenticate', u'true'), (u'spark.ssl.trustStore',
> u'xxx.truststore')]
>
> I am not really familiar with "spark.yarn.credentials.file" and had
> thought it was created automatically after communicating with YARN to get
> tokens.
>
> Thanks,
> Mike
>
>
> [image: Inactive hide details for Ted Yu ---11/11/2015 03:35:41 PM---I
> assume your config contains "spark.yarn.credentials.file" - othe]Ted Yu
> ---11/11/2015 03:35:41 PM---I assume your config contains
> "spark.yarn.credentials.file" - otherwise startExecutorDelegationToken
>
> From: Ted Yu 
> To: Michael V Le/Watson/IBM@IBMUS
> Cc: user 
> Date: 11/11/2015 03:35 PM
> Subject: Re: Creating new Spark context when running in Secure YARN fails
> --
>
>
>
> I assume your config contains "spark.yarn.credentials.file" -
> otherwise startExecutorDelegationTokenRenewer(conf) call would be skipped.
>
> On Wed, Nov 11, 2015 at 12:16 PM, Michael V Le <*m...@us.ibm.com*
> > wrote:
>
>Hi Ted,
>
>Thanks for reply.
>
>I tried your patch but am having the same problem.
>
>I ran:
>
>./bin/pyspark --master yarn-client
>
>>> sc.stop()
>>> sc = SparkContext()
>
>Same error dump as below.
>
>Do I need to pass something to the new sparkcontext ?
>
>Thanks,
>Mike
>
>[image: Inactive hide details for Ted Yu ---11/11/2015 01:55:02
>PM---Looks like the delegation token should be renewed. Mind trying the]Ted
>Yu ---11/11/2015 01:55:02 PM---Looks like the delegation token should be
>renewed. Mind trying the following ?
>
>From: Ted Yu <*yuzhih...@gmail.com* >
>To: Michael V Le/Watson/IBM@IBMUS
>Cc: user <*user@spark.apache.org* >
>Date: 11/11/2015 01:55 PM
>Subject: Re: Creating new Spark context when running in Secure YARN
>fails
>--
>
>
>
>
>Looks like the delegation token should be renewed.
>
>Mind trying the following ?
>
>Thanks
>
>diff --git
>
> a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
>
> b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerB
>index 20771f6..e3c4a5a 100644
>---
>
> a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
>+++
>
> b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
>@@ -53,6 +53,12 @@ private[spark] class YarnClientSchedulerBackend(
> logDebug("ClientArguments called with: " +
>argsArrayBuf.mkString(" "))
> val args = new ClientArguments(argsArrayBuf.toArray, conf)
> totalExpectedExecutors = args.numExecutors
>+// SPARK-8851: In yarn-client mode, the AM still does the
>credentials refresh. The driver
>+// reads the credentials from HDFS, just like the executors and
>updates its own credentials
>+// cache.
>+if (conf.contains("spark.yarn.credentials.file")) {
>+
> YarnSparkHadoopUtil.get.startExecutorDelegationTokenRenewer(conf)
>+}
> client = new Client(args, conf)
> appId = client.submitApplication()
>
>@@ -63,12 +69,6 @@ private[spark] class YarnClientSchedulerBackend(
>
> waitForApplication()
>
>-// SPARK-8851: In yarn-client mode, the AM still does the
>credentials refresh. The driver
>-// reads the credentials from HDFS, just like the executors and
>updates its own credentials
>-// cache.
>-if 

Upgrading Spark in EC2 clusters

2015-11-11 Thread Augustus Hong
Hey All,

I have a Spark cluster(running version 1.5.0) on EC2 launched with the
provided spark-ec2 scripts. If I want to upgrade Spark to 1.5.2 in the same
cluster, what's the safest / recommended way to do that?


I know I can spin up a new cluster running 1.5.2, but it doesn't seem
efficient to spin up a new cluster every time we need to upgrade.


Thanks,
Augustus





-- 
[image: Branch Metrics mobile deep linking] * Augustus
Hong*
 Data Analytics | Branch Metrics
 m 650-391-3369 | e augus...@branch.io


Status of 2.11 support?

2015-11-11 Thread shajra-cogscale
Hi,

My company isn't using Spark in production yet, but we are using a bit of
Scala.  There's a few people who have wanted to be conservative and keep our
Scala at 2.10 in the event we start using Spark.  There are others who want
to move to 2.11 with the idea that by the time we're using Spark it will be
more or less 2.11-ready.

It's hard to make a strong judgement on these kinds of things without
getting some community feedback.

Looking through the internet I saw:

1) There's advice to build 2.11 packages from source -- but also published
jars to Maven Central for 2.11.  Are these jars on Maven Central usable and
the advice to build from source outdated?

2)  There's a note that the JDBC RDD isn't 2.11-compliant.  This is okay for
us, but is there anything else to worry about?

It would be nice to get some answers to those questions as well as any other
feedback from maintainers or anyone that's used Spark with Scala 2.11 beyond
simple examples.

Thanks,
Sukant



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-2-11-support-tp25362.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Status of 2.11 support?

2015-11-11 Thread Ted Yu
For #1, the published jars are usable.
However, you should build from source for your specific combination of
profiles.

Cheers

On Wed, Nov 11, 2015 at 3:22 PM, shajra-cogscale 
wrote:

> Hi,
>
> My company isn't using Spark in production yet, but we are using a bit of
> Scala.  There's a few people who have wanted to be conservative and keep
> our
> Scala at 2.10 in the event we start using Spark.  There are others who want
> to move to 2.11 with the idea that by the time we're using Spark it will be
> more or less 2.11-ready.
>
> It's hard to make a strong judgement on these kinds of things without
> getting some community feedback.
>
> Looking through the internet I saw:
>
> 1) There's advice to build 2.11 packages from source -- but also published
> jars to Maven Central for 2.11.  Are these jars on Maven Central usable and
> the advice to build from source outdated?
>
> 2)  There's a note that the JDBC RDD isn't 2.11-compliant.  This is okay
> for
> us, but is there anything else to worry about?
>
> It would be nice to get some answers to those questions as well as any
> other
> feedback from maintainers or anyone that's used Spark with Scala 2.11
> beyond
> simple examples.
>
> Thanks,
> Sukant
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-2-11-support-tp25362.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How can you sort wordcounts by counts in stateful_network_wordcount.py example

2015-11-11 Thread ayan guha
how about this?

sorted = running_counts.map(lambda t: t[1],t[0]).sortByKey()

Basically swap key and value of the RDD and then sort?

On Thu, Nov 12, 2015 at 8:53 AM, Amir Rahnama  wrote:

> Hey,
>
> Anybody knows how can one sort the result in the stateful example?
>
> Python would be prefered.
>
>
> https://github.com/apache/spark/blob/859dff56eb0f8c63c86e7e900a12340c199e6247/examples/src/main/python/streaming/stateful_network_wordcount.py
> --
> Thanks and Regards,
>
> Amir Hossein Rahnama
>
> *Tel: +46 (0) 761 681 102*
> Website: www.ambodi.com
> Twitter: @_ambodi 
>



-- 
Best Regards,
Ayan Guha