Re: Creating new Spark context when running in Secure YARN fails

2015-11-20 Thread Hari Shreedharan
Can you try this: https://github.com/apache/spark/pull/9875 
<https://github.com/apache/spark/pull/9875>. I believe this patch should fix 
the issue here.

Thanks,
Hari Shreedharan




> On Nov 11, 2015, at 1:59 PM, Ted Yu  wrote:
> 
> Please take a look at 
> yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala
>  where this config is described
> 
> Cheers
> 
> On Wed, Nov 11, 2015 at 1:45 PM, Michael V Le  <mailto:m...@us.ibm.com>> wrote:
> It looks like my config does not have "spark.yarn.credentials.file".
> 
> I executed:
> sc._conf.getAll()
> 
> [(u'spark.ssl.keyStore', u'xxx.keystore'), (u'spark.eventLog.enabled', 
> u'true'), (u'spark.ssl.keyStorePassword', u'XXX'), (u'spark.yarn.principal', 
> u'XXX'), (u'spark.master', u'yarn-client'), (u'spark.ssl.keyPassword', 
> u'XXX'), (u'spark.authenticate.sasl.serverAlwaysEncrypt', u'true'), 
> (u'spark.ssl.trustStorePassword', u'XXX'), (u'spark.ssl.protocol', 
> u'TLSv1.2'), (u'spark.authenticate.enableSaslEncryption', u'true'), 
> (u'spark.app.name <http://spark.app.name/>', u'PySparkShell'), 
> (u'spark.yarn.keytab', u'XXX.keytab'), (u'spark.yarn.historyServer.address', 
> u'xxx-001:18080'), (u'spark.rdd.compress', u'True'), (u'spark.eventLog.dir', 
> u'hdfs://xxx-001:9000/user/hadoop/sparklogs'), 
> (u'spark.ssl.enabledAlgorithms', 
> u'TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA'), 
> (u'spark.serializer.objectStreamReset', u'100'), 
> (u'spark.history.fs.logDirectory', 
> u'hdfs://xxx-001:9000/user/hadoop/sparklogs'), (u'spark.yarn.isPython', 
> u'true'), (u'spark.submit.deployMode', u'client'), (u'spark.ssl.enabled', 
> u'true'), (u'spark.authenticate', u'true'), (u'spark.ssl.trustStore', 
> u'xxx.truststore')]
> 
> I am not really familiar with "spark.yarn.credentials.file" and had thought 
> it was created automatically after communicating with YARN to get tokens.
> 
> Thanks,
> Mike
> 
> 
> Ted Yu ---11/11/2015 03:35:41 PM---I assume your config contains 
> "spark.yarn.credentials.file" - otherwise startExecutorDelegationToken
> 
> From: Ted Yu mailto:yuzhih...@gmail.com>>
> To: Michael V Le/Watson/IBM@IBMUS
> Cc: user mailto:user@spark.apache.org>>
> Date: 11/11/2015 03:35 PM
> Subject: Re: Creating new Spark context when running in Secure YARN fails
> 
> 
> 
> 
> I assume your config contains "spark.yarn.credentials.file" - otherwise 
> startExecutorDelegationTokenRenewer(conf) call would be skipped.
> 
> On Wed, Nov 11, 2015 at 12:16 PM, Michael V Le  <mailto:m...@us.ibm.com>> wrote:
> Hi Ted,
> 
> Thanks for reply.
> 
> I tried your patch but am having the same problem.
> 
> I ran:
> 
> ./bin/pyspark --master yarn-client
> 
> >> sc.stop()
> >> sc = SparkContext()
> 
> Same error dump as below.
> 
> Do I need to pass something to the new sparkcontext ?
> 
> Thanks,
> Mike
> 
> Ted Yu ---11/11/2015 01:55:02 PM---Looks like the delegation 
> token should be renewed. Mind trying the following ?
> 
> From: Ted Yu mailto:yuzhih...@gmail.com>>
> To: Michael V Le/Watson/IBM@IBMUS
> Cc: user mailto:user@spark.apache.org>>
> Date: 11/11/2015 01:55 PM
> Subject: Re: Creating new Spark context when running in Secure YARN fails
> 
> 
> 
> 
> Looks like the delegation token should be renewed.
> 
> Mind trying the following ?
> 
> Thanks
> 
> diff --git 
> a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
>  b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerB
> index 20771f6..e3c4a5a 100644
> --- 
> a/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
> +++ 
> b/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala
> @@ -53,6 +53,12 @@ private[spark] class YarnClientSchedulerBackend(
>  logDebug("ClientArguments called with: " + argsArrayBuf.mkString(" "))
>  val args = new ClientArguments(argsArrayBuf.toArray, conf)
>  totalExpectedExecutors = args.numExecutors
> +// SPARK-8851: In yarn-client mode, the AM still does the credentials 
> refresh. The driver
> +// re

Re: Monitoring tools for spark streaming

2015-09-28 Thread Hari Shreedharan
+1. The Streaming UI should give you more than enough information.




Thanks, Hari

On Mon, Sep 28, 2015 at 9:55 PM, Shixiong Zhu  wrote:

> Which version are you using? Could you take a look at the new Streaming UI
> in 1.4.0?
> Best Regards,
> Shixiong Zhu
> 2015-09-29 7:52 GMT+08:00 Siva :
>> Hi,
>>
>> Could someone recommend the monitoring tools for spark streaming?
>>
>> By extending StreamingListener we can dump the delay in processing of
>> batches and some alert messages.
>>
>> But are there any Web UI tools where we can monitor failures, see delays
>> in processing, error messages and setup alerts etc.
>>
>> Thanks
>>
>>

Re: Spark Streaming failing on YARN Cluster

2015-08-19 Thread Hari Shreedharan
It looks like you are having issues with the files getting distributed to
the cluster. What is the exception you are getting now?

On Wednesday, August 19, 2015, Ramkumar V  wrote:

> Thanks a lot for your suggestion. I had modified HADOOP_CONF_DIR in
> spark-env.sh so that core-site.xml is under HADOOP_CONF_DIR. i can able
> to see the logs like that you had shown above. Now i can able to run for 3
> minutes and store results between every minutes. After sometimes, there is
> an exception. How to fix this exception ? and Can you please explain where
> its going wrong ?
>
> *Log Link : http://pastebin.com/xL9jaRUa  *
>
>
> *Thanks*,
> 
>
>
> On Wed, Aug 19, 2015 at 1:54 PM, Jeff Zhang  > wrote:
>
>> HADOOP_CONF_DIR is the environment variable point to the hadoop conf
>> directory.  Not sure how CDH organize that, make sure core-site.xml is
>> under HADOOP_CONF_DIR.
>>
>> On Wed, Aug 19, 2015 at 4:06 PM, Ramkumar V > > wrote:
>>
>>> We are using Cloudera-5.3.1. since it is one of the earlier version of
>>> CDH, it doesnt supports the latest version of spark. So i installed
>>> spark-1.4.1 separately in my machine. I couldnt able to do spark-submit in
>>> cluster mode. How to core-site.xml under classpath ? it will be very
>>> helpful if you could explain in detail to solve this issue.
>>>
>>> *Thanks*,
>>> 
>>>
>>>
>>> On Fri, Aug 14, 2015 at 8:25 AM, Jeff Zhang >> > wrote:
>>>

1. 15/08/12 13:24:49 INFO Client: Source and destination file
systems are the same. Not copying

 file:/home/hdfs/spark-1.4.1/assembly/target/scala-2.10/spark-assembly-1.4.1-hadoop2.5.0-cdh5.3.5.jar
2. 15/08/12 13:24:49 INFO Client: Source and destination file
systems are the same. Not copying

 file:/home/hdfs/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1.jar
3. 15/08/12 13:24:49 INFO Client: Source and destination file
systems are the same. Not copying
file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip
4. 15/08/12 13:24:49 INFO Client: Source and destination file
systems are the same. Not copying
file:/home/hdfs/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip
5. 15/08/12 13:24:49 INFO Client: Source and destination file
systems are the same. Not copying
file:/home/hdfs/spark-1.4.1/examples/src/main/python/streaming/kyt.py
6.


1. diagnostics: Application application_1437639737006_3808 failed 2
times due to AM Container for appattempt_1437639737006_3808_02 
 exited
with  exitCode: -1000 due to: File
file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip does not exist
2. .Failing this attempt.. Failing the application.



 The machine you run spark is the client machine, while the yarn AM is
 running on another machine. And the yarn AM complains that the files are
 not found as your logs shown.
 From the logs, its seems that these files are not copied to the HDFS as
 local resources. I doubt that you didn't put core-site.xml under your
 classpath, so that spark can not detect your remote file system and won't
 copy the files to hdfs as local resources. Usually in yarn-cluster mode,
 you should be able to see the logs like following.

 > 15/08/14 10:48:49 INFO yarn.Client: Preparing resources for our AM
 container
 > 15/08/14 10:48:49 INFO yarn.Client: Uploading resource
 file:/Users/abc/github/spark/assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop2.6.0.jar
 -> hdfs://
 0.0.0.0:9000/user/abc/.sparkStaging/application_1439432662178_0019/spark-assembly-1.5.0-SNAPSHOT-hadoop2.6.0.jar
 > 15/08/14 10:48:50 INFO yarn.Client: Uploading resource
 file:/Users/abc/github/spark/spark.py -> hdfs://
 0.0.0.0:9000/user/abc/.sparkStaging/application_1439432662178_0019/spark.py
 > 15/08/14 10:48:50 INFO yarn.Client: Uploading resource
 file:/Users/abc/github/spark/python/lib/pyspark.zip -> hdfs://
 0.0.0.0:9000/user/abc/.sparkStaging/application_1439432662178_0019/pyspark.zip

 On Thu, Aug 13, 2015 at 2:50 PM, Ramkumar V >>> > wrote:

> Hi,
>
> I have a cluster of 1 master and 2 slaves. I'm running a spark
> streaming in master and I want to utilize all nodes in my cluster. i had
> specified some parameters like driver memory and executor memory in my
> code. when i give --deploy-mode cluster --master yarn-cluster in my
> spark-submit, it gives the following error.
>
> Log link : *http://pastebin.com/kfyVWDGR
> *
>
> How to fix this issue ? Please help me if i'm doing wrong.
>
>
> *Thanks*,
> Ramkumar V
>
>


 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>>
>> --
>> Best

Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Hari Shreedharan
Btw, if you want to write to Spark Streaming from Flume -- there is a sink
(it is a part of Spark, not Flume). See Approach 2 here:
http://spark.apache.org/docs/latest/streaming-flume-integration.html



On Wed, Nov 19, 2014 at 12:41 PM, Hari Shreedharan <
hshreedha...@cloudera.com> wrote:

> As of now, you can feed Spark Streaming from both kafka and flume.
> Currently though there is no API to write data back to either of the two
> directly.
>
> I sent a PR which should eventually add something like this:
> https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
> that would allow Spark Streaming to write back to Kafka. This will likely
> be reviewed and committed after 1.2.
>
> I would consider writing something similar to push data to Flume as well,
> if there is a sufficient use-case for it. I have seen people talk about
> writing back to kafka quite a bit - hence the above patch.
>
> Which one is better is upto your use-case and existing infrastructure and
> preference. Both would work as is, but writing back to Flume would usually
> be if you want to write to HDFS/HBase/Solr etc -- which you could write
> back directly from Spark Streaming (of course, there are benefits of
> writing back using Flume like the additional buffering etc Flume gives),
> but it is still possible to do so from Spark Streaming itself.
>
> But for Kafka, the usual use-case is a variety of custom applications
> reading the same data -- for which it makes a whole lot of sense to write
> back to Kafka. An example is to sanitize incoming data in Spark Streaming
> (from Flume or Kafka or something else) and make it available for a variety
> of apps via Kafka.
>
> Hope this helps!
>
> Hari
>
>
> On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz 
> wrote:
>
>> Hi,
>>
>> I'm starting with Spark and I just trying to understand if I want to
>> use Spark Streaming, should I use to feed it Flume or Kafka? I think
>> there's not a official Sink for Flume to Spark Streaming and it seems
>> that Kafka it fits better since gives you readibility.
>>
>> Could someone give a good scenario for each alternative? When would it
>> make sense to use Kafka and when Flume for Spark Streaming?
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark Streaming with Flume or Kafka?

2014-11-19 Thread Hari Shreedharan
As of now, you can feed Spark Streaming from both kafka and flume.
Currently though there is no API to write data back to either of the two
directly.

I sent a PR which should eventually add something like this:
https://github.com/harishreedharan/spark/blob/Kafka-output/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala
that would allow Spark Streaming to write back to Kafka. This will likely
be reviewed and committed after 1.2.

I would consider writing something similar to push data to Flume as well,
if there is a sufficient use-case for it. I have seen people talk about
writing back to kafka quite a bit - hence the above patch.

Which one is better is upto your use-case and existing infrastructure and
preference. Both would work as is, but writing back to Flume would usually
be if you want to write to HDFS/HBase/Solr etc -- which you could write
back directly from Spark Streaming (of course, there are benefits of
writing back using Flume like the additional buffering etc Flume gives),
but it is still possible to do so from Spark Streaming itself.

But for Kafka, the usual use-case is a variety of custom applications
reading the same data -- for which it makes a whole lot of sense to write
back to Kafka. An example is to sanitize incoming data in Spark Streaming
(from Flume or Kafka or something else) and make it available for a variety
of apps via Kafka.

Hope this helps!

Hari


On Wed, Nov 19, 2014 at 8:10 AM, Guillermo Ortiz 
wrote:

> Hi,
>
> I'm starting with Spark and I just trying to understand if I want to
> use Spark Streaming, should I use to feed it Flume or Kafka? I think
> there's not a official Sink for Flume to Spark Streaming and it seems
> that Kafka it fits better since gives you readibility.
>
> Could someone give a good scenario for each alternative? When would it
> make sense to use Kafka and when Flume for Spark Streaming?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark and Scala

2014-09-12 Thread Hari Shreedharan
No, Scala primitives remain primitives. Unless you create an RDD using one
of the many methods - you would not be able to access any of the RDD
methods. There is no automatic porting. Spark is an application as far as
scala is concerned - there is no compilation (except of course, the scala,
JIT compilation etc).

On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan 
wrote:

> I know that unpersist is a method on RDD.
> But my confusion is that, when we port our Scala programs to Spark,
> doesn't everything change to RDDs?
>
> On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> unpersist is a method on RDDs. RDDs are abstractions introduced by Spark.
>>
>> An Int is just a Scala Int. You can't call unpersist on Int in Scala, and
>> that doesn't change in Spark.
>>
>> On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan > > wrote:
>>
>>> There is one thing that I am confused about.
>>> Spark has codes that have been implemented in Scala. Now, can we run any
>>> Scala code on the Spark framework? What will be the difference in the
>>> execution of the scala code in normal systems and on Spark?
>>> The reason for my question is the following:
>>> I had a variable
>>> *val temp = *
>>> This temp was being created inside the loop, so as to manually throw it
>>> out of the cache, every time the loop ends I was calling
>>> *temp.unpersist()*, this was returning an error saying that *value
>>> unpersist is not a method of Int*, which means that temp is an Int.
>>> Can some one explain to me why I was not able to call *unpersist* on
>>> *temp*?
>>>
>>> Thank You
>>>
>>
>>
>


Re: Spark Streaming on Yarn Input from Flume

2014-08-07 Thread Hari Shreedharan
Do you see anything suspicious in the logs? How did you run the application?


On Thu, Aug 7, 2014 at 10:02 PM, XiaoQinyu 
wrote:

> Hi~
>
> I run a spark streaming app to receive data from flume event.When I run on
> standalone,Spark Streaming can receive the Flume event normally .But if I
> run this app on yarn,no matter on yarn-client or yarn-cluster. This app can
> not receive data from flume,and I check the net stat,I find the port which
> connect to flume,didn't be listened.
>
> I am not sure,if spark streaming run on Yarn,will affect connect to flume.
>
> And then I run the FlumeEventCount  example, when I run this example on
> Yarn, it also can not receive data from flume.
>
> And  I will be very pleasure if some one can help me.
>
>
> XiaoQinyu
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-Yarn-Input-from-Flume-tp11755.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: store spark streaming dstream in hdfs or cassandra

2014-07-31 Thread Hari Shreedharan
Off the top of my head, you can use the ForEachDStream to which you pass 
in the code that writes to Hadoop, and then register that as an output 
stream, so the function you pass in is periodically executed and causes 
the data to be written to HDFS. If you are ok with the data being in 
text format - simply use saveAsTextFiles method in the RDD class.




salemi wrote:


Hi,

I was wondering what is the best way to store off dstreams in hdfs or
casandra.
Could somebody provide an example?

Thanks,
Ali



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/store-spark-streaming-dstream-in-hdfs-or-cassandra-tp11064.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark and Flume integration - do I understand this correctly?

2014-07-29 Thread Hari Shreedharan

Hi,

Deploying spark with Flume is pretty simple. What you'd need to do is:

1. Start your spark Flume DStream Receiver on some machine using one of 
the FlumeUtils.createStream methods - where you need to specify the 
hostname and port of the worker node on which you want the spark 
executor to run - say a.b.c.d: 4585. This is where Spark will receive 
the data from Flume.


2. Once you application has started, start the flume agent(s) which are 
going to be sending the data, with Avro sinks with hostname set to: 
a.b.c.d and port set to 4585.


And you are done!

Tathagata Das wrote:


Hari, can you help?

TD

On Tue, Jul 29, 2014 at 12:13 PM, dapooley  wrote:


Hi,

I am trying to integrate Spark onto a Flume log sink and avro source. The
sink is on one machine (the application), and the source is on 
another. Log
events are being sent from the application server to the avro source 
server

(a log directory sink on the arvo source prints to verify)

The aim is to get Spark to also receive the same events that the avro 
source

is getting. The steps, I believe, are:

1. install/start Spark master (on avro source machine).
2. write spark application, deploy (on avro source machine).
3. add spark application as a worker to the master.
4. have spark application configured to same port as avro source

Test setup is using 2 ubuntu VMs on a Windows host.

Flume configuration:

# application ##
## Tail application log file
# /var/lib/apache-flume-1.5.0-bin/bin/flume-ng agent -n cps -c conf -f
conf/flume-conf.properties
# http://flume.apache.org/FlumeUserGuide.html#exec-source
source_agent.sources = tomcat
source_agent.sources.tomcat.type = exec
source_agent.sources.tomcat.command = tail -F
/var/lib/tomcat/logs/application.log
source_agent.sources.tomcat.batchSize = 1
source_agent.sources.tomcat.channels = memoryChannel

# http://flume.apache.org/FlumeUserGuide.html#memory-channel
source_agent.channels = memoryChannel
source_agent.channels.memoryChannel.type = memory
source_agent.channels.memoryChannel.capacity = 100

## Send to Flume Collector on Analytics Node
# http://flume.apache.org/FlumeUserGuide.html#avro-sink
source_agent.sinks = avro_sink
source_agent.sinks.avro_sink.type = avro
source_agent.sinks.avro_sink.channel = memoryChannel
source_agent.sinks.avro_sink.hostname = 10.0.2.2
source_agent.sinks.avro_sink.port = 41414


 avro source ##
## Receive Flume events for Spark streaming

# http://flume.apache.org/FlumeUserGuide.html#memory-channel
agent1.channels = memoryChannel
agent1.channels.memoryChannel.type = memory
agent1.channels.memoryChannel.capacity = 100

## Flume Collector on Analytics Node
# http://flume.apache.org/FlumeUserGuide.html#avro-source
agent1.sources = avroSource
agent1.sources.avroSource.type = avro
agent1.sources.avroSource.channels = memoryChannel
agent1.sources.avroSource.bind = 0.0.0.0
agent1.sources.avroSource.port = 41414

#Sinks
agent1.sinks = localout

#http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
agent1.sinks.localout.type = file_roll
agent1.sinks.localout.sink.directory = /home/vagrant/flume/logs
agent1.sinks.localout.sink.rollInterval = 0
agent1.sinks.localout.channel = memoryChannel

thank you in advance for any assistance,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Flume-integration-do-I-understand-this-correctly-tp10879.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.




Tathagata Das 
July 29, 2014 at 1:52 PM
Hari, can you help?

TD


--

Thanks,
Hari