Re: Spark Streaming-- for each new file in HDFS

2016-09-16 Thread Steve Loughran
uts an empty file with the same filename as the original file, but the extension ".uploaded". This is an indicator that the file has been fully (!) written to the fs. Otherwise you risk only reading parts of the file. Then, you can have a file system listener for this .upload file. S

Re: Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Peyman Mohajerian
y file with the > same filename as the original file, but the extension ".uploaded". This is > an indicator that the file has been fully (!) written to the fs. Otherwise > you risk only reading parts of the file. > Then, you can have a file system listener for this .upload

Re: Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Jörn Franke
ave a file system listener for this .upload file. Spark streaming or Kafka are not needed/suitable, if the server is a file server. You can use oozie (maybe with a simple custom action) to poll for .uploaded files and transmit them. > On 15 Sep 2016, at 19:00, Kappaganthu, Sivaram (ES) > w

Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Kappaganthu, Sivaram (ES)
) is it possible for spark-stream to trigger a job after a file is placed instead of triggering a job at fixed batch interval? 2) If it is not possible with Spark-streaming, can we control this with Kafka/Flume Thanks, Sivaram

Re: Spark Streaming - dividing DStream into mini batches

2016-09-15 Thread Cody Koeninger
KafkaRDD >> respectively if you want to see an example >> >> On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie wrote: >> > Ah, that makes it much clearer, thanks! >> > >> > It also brings up an additional question: who/what decides on the >> > partiti

Re: Spark Streaming - dividing DStream into mini batches

2016-09-15 Thread Daan Debie
within a time slot. > >> It's one RDD within a time slot, with multiple partitions. > >> > >> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie > wrote: > >> > Thanks, but that thread does not answer my questions, which are about > >>

Re: Reading the most recent text files created by Spark streaming

2016-09-15 Thread Mich Talebzadeh
mechanisms and has been explicitly > designed for import/export and is tested. Not sure if i would go for spark > streaming if the use case is only storing, but I do not have the full > picture of your use case. > > Anyway, what you could do is create a directory / hour/ day etc (whatever >

The coming data on Spark Streaming

2016-09-14 Thread pcandido
Hi everyone, I'm starting in Spark Streaming and would like to know somethings about data arriving. I know that SS uses micro-batches and they are received by workers and sent to RDD. The master, on defined intervals, receives a poiter to micro-batch in RDD and can use it to process data

Re: Reading the most recent text files created by Spark streaming

2016-09-14 Thread Jörn Franke
Hi, An alternative to Spark could be flume to store data from Kafka to HDFS. It provides also some reliability mechanisms and has been explicitly designed for import/export and is tested. Not sure if i would go for spark streaming if the use case is only storing, but I do not have the full

Reading the most recent text files created by Spark streaming

2016-09-14 Thread Mich Talebzadeh
Hi, I have a Spark streaming that reads messages/prices from Kafka and writes it as text file to HDFS. This is pretty efficient. Its only function is to persist the incoming messages to HDFS. This is what it does dstream.foreachRDD { pricesRDD => val x= pricesRDD.co

Re: Spark Streaming - dividing DStream into mini batches

2016-09-14 Thread Daan Debie
t; It also brings up an additional question: who/what decides on the > > partitioning? Does Spark Streaming decide to divide a micro batch/RDD > into > > more than 1 partition based on size? Or is it something that the "source" > > (SocketStream, KafkaStream etc.) de

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Daan Debie
Ah, that makes it much clearer, thanks! It also brings up an additional question: who/what decides on the partitioning? Does Spark Streaming decide to divide a micro batch/RDD into more than 1 partition based on size? Or is it something that the "source" (SocketStream, KafkaStream etc

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Cody Koeninger
example On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie wrote: > Ah, that makes it much clearer, thanks! > > It also brings up an additional question: who/what decides on the > partitioning? Does Spark Streaming decide to divide a micro batch/RDD into > more than 1 partition based on

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Cody Koeninger
bie wrote: > Thanks, but that thread does not answer my questions, which are about the > distributed nature of RDDs vs the small nature of "micro batches" and on how > Spark Streaming distributes work. > > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh > wrote: >>

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Daan Debie
Thanks, but that thread does not answer my questions, which are about the distributed nature of RDDs vs the small nature of "micro batches" and on how Spark Streaming distributes work. On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh wrote: > Hi Daan, > > You may find this

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Mich Talebzadeh
Hi Daan, You may find this link Re: Is "spark streaming" streaming or mini-batch? <https://www.mail-archive.com/user@spark.apache.org/msg55914.html> helpful. This was a thread in this forum not long ago. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com

Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread DandyDev
Hi all! When reading about Spark Streaming and its execution model, I see diagrams like this a lot: <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27699/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala-31-638.jpg> It does a fine job explaini

Re: Getting figures from spark streaming

2016-09-12 Thread Thunder Stumpges
, Ashok Kumar > wrote: > > > Hello Gurus, > > I am creating some figures and feed them into Kafka and then spark > streaming. > > It works OK but I have the following issue. > > For now as a test I sent 5 prices in each batch interval. In the loop code > this is what is h

Re: spark streaming kafka connector questions

2016-09-10 Thread Cody Koeninger
he-spark-user-list. 1001560.n3.nabble.com/spark-streaming-kafka-connector- questions-tp27681p27687.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark streaming kafka connector questions

2016-09-10 Thread Cheng Yi
hed without really doing the work. Help is hugely appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafka-connector-questions-tp27681p27687.html Sent from the Apache Spark User List mailing list archive at Nabble.

Re: spark streaming kafka connector questions

2016-09-10 Thread 毅程
sing other things > > - You're not using the latest version, the latest version is for spark > 2.0. There are two versions of the connector for spark 2.0, one for > kafka 0.8 or higher, and one for kafka 0.10 or higher > > - Committing individual messages to kafka doesn't make an

Re: spark streaming kafka connector questions

2016-09-08 Thread Cody Koeninger
re are two versions of the connector for spark 2.0, one for kafka 0.8 or higher, and one for kafka 0.10 or higher - Committing individual messages to kafka doesn't make any sense, spark streaming deals with batches. If you're doing any aggregations that involve shuffling, there isn

spark streaming kafka connector questions

2016-09-08 Thread Cheng Yi
I am using the lastest streaming kafka connector org.apache.spark spark-streaming-kafka_2.11 1.6.2 I am facing the problem that a message is delivered two times to my consumers. these two deliveries are 10+ seconds apart, it looks this is caused by my lengthy message processing (took about 60

Posting selected rows of Spark streaming data to Hive table

2016-09-08 Thread Mich Talebzadeh
Hi, Within spark streaming I have identified the data that I want to persist to a Hive table. Table is already created. These are the values for columns extracted for(line <- pricesRDD.collect.toArray) { var index = line._2.split(',').view(0).toInt

Re: Creating HiveContext withing Spark streaming

2016-09-08 Thread Mich Talebzadeh
cutors", >>sparkWorkerUiRetainedExecutorsValue) >> sparkConf.set("sparkWorkerUiRetainedStages", >>sparkWorkerUiRetainedStagesValue) >> sparkConf.set("sparkUiRetainedJobs", >> sparkUiRetainedJob

Re: Creating HiveContext withing Spark streaming

2016-09-08 Thread Todd Nist
rk.streaming.stopGracefullyOnShutdown"," > true") > sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", > "true") > > sparkConf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite", > "tr

Re: Creating HiveContext withing Spark streaming

2016-09-08 Thread Mich Talebzadeh
6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author

Creating HiveContext withing Spark streaming

2016-09-08 Thread Mich Talebzadeh
Hi, This may not be feasible in Spark streaming. I am trying to create a HiveContext in Spark streaming within the streaming context // Create a local StreamingContext with two working thread and batch interval of 2 seconds. val sparkConf = new SparkConf(). setAppName

Re: Getting figures from spark streaming

2016-09-07 Thread Ashok Kumar
Any help on this warmly appreciated. On Tuesday, 6 September 2016, 21:31, Ashok Kumar wrote: Hello Gurus, I am creating some figures and feed them into Kafka and then spark streaming. It works OK but I have the following issue. For now as a test I sent 5 prices in each batch interval

Re: Q: Multiple spark streaming app, one kafka topic, same consumer group

2016-09-06 Thread Cody Koeninger
In general, see the material linked from https://github.com/koeninger/kafka-exactly-once if you want a better understanding of the direct stream. For spark-streaming-kafka-0-8, the direct stream doesn't really care about consumer group, since it uses the simple consumer. For the 0.10 ve

Q: Multiple spark streaming app, one kafka topic, same consumer group

2016-09-06 Thread Mariano Semelman
Hello everybody, I am trying to understand how Kafka Direct Stream works. I'm interested in having a production ready Spark Streaming application that consumes a Kafka topic. But I need to guarantee there's (almost) no downtime, specially during deploys (and submit) of new versions. Wha

Getting figures from spark streaming

2016-09-06 Thread Ashok Kumar
Hello Gurus, I am creating some figures and feed them into Kafka and then spark streaming. It works OK but I have the following issue. For now as a test I sent 5 prices in each batch interval. In the loop code this is what is hapening       dstream.foreachRDD { rdd =>     val x= rdd.count    

Reuters Market Data System connection to Spark Streaming

2016-09-04 Thread Mich Talebzadeh
Hi, Has anyone had experience of using such messaging system like Kafka to connect Reuters Market Data System to Spark Streaming by any chance. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.

[HELP] Force stop a Spark Streaming application running on EMR

2016-09-01 Thread Rajkiran Rajkumar
Hi Spark community, I have a Spark streaming application that reads from a Kinesis stream and processes data. It calls some services which can experience transient failures. When such a transient failure happens, a retry mechanism kicks into action. For the shutdown use case, I have a separate

Re: Dynamic Allocation & Spark Streaming

2016-08-30 Thread Liren Ding
It's has been a while since last update on the thread. Now Spark 2.0 is available, do you guys know if there's any progress on Dynamic Allocation & Spark Streaming? On Mon, Oct 19, 2015 at 1:13 PM, robert towne wrote: > I have watched a few videos from Databricks/Andrew Or aro

How to attach a ReceiverSupervisor for a Custom receiver in Spark Streaming?

2016-08-29 Thread kant kodali
How to attach a ReceiverSupervisor for a Custom receiver in Spark Streaming?

Spark Streaming batch sequence number

2016-08-29 Thread Matt Smith
Is it possible to get a sequence number for the current batch (ie. first batch is 0, second is 1, etc?).

can I use cassandra for checkpointing during a spark streaming job

2016-08-29 Thread kant kodali
I understand that I cannot use spark streaming window operation without checkpointing to HDFS but Without window operation I don't think we can do much with spark streaming. so since it is very essential can I use Cassandra as a distributed storage? If so, can I see an example on how I can

Write parquet file from Spark Streaming

2016-08-27 Thread Kevin Tran
Hi Everyone, Does anyone know how to write parquet file after parsing data in Spark Streaming? Thanks, Kevin.

Re: Reading parquet files into Spark Streaming

2016-08-27 Thread Sebastian Piu
. > > Cheers > > On Sat, 27 Aug 2016, 14:24 Renato Marroquín Mogrovejo, < > renatoj.marroq...@gmail.com> wrote: > >> Hi Akhilesh, >> >> Thanks for your response. >> I am using Spark 1.6.1 and what I am trying to do is to ingest parquet >> files in

Re: Reading parquet files into Spark Streaming

2016-08-27 Thread Sebastian Piu
k 1.6.1 and what I am trying to do is to ingest parquet > files into the Spark Streaming, not in batch operations. > > val ssc = new StreamingContext(sc, Seconds(5)) > ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class", > "parquet.avr

Re: Reading parquet files into Spark Streaming

2016-08-27 Thread Renato Marroquín Mogrovejo
Hi Akhilesh, Thanks for your response. I am using Spark 1.6.1 and what I am trying to do is to ingest parquet files into the Spark Streaming, not in batch operations. val ssc = new StreamingContext(sc, Seconds(5)) ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.

Re: Reading parquet files into Spark Streaming

2016-08-27 Thread Akhilesh Pathodia
Hi Renato, Which version of Spark are you using? If spark version is 1.3.0 or more then you can use SqlContext to read the parquet file which will give you DataFrame. Please follow the below link: https://spark.apache.org/docs/1.5.0/sql-programming-guide.html#loading-data-programmatically Thank

Re: Reading parquet files into Spark Streaming

2016-08-26 Thread Akhilesh Pathodia
Hi Renato, Which version of Spark are you using? If spark version is 1.3.0 or more then you can use SqlContext to read the parquet file which will give you DataFrame. Please follow the below link: https://spark.apache.org/docs/1.5.0/sql-programming-guide.html#loading-data-programmatically Thank

Re: is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread kant kodali
pache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Aug 26, 2016 at 9:42 PM, kant kodali < kanth...@gmail.com > wrote: is there a HTTP2 (v2) endpoint for Spark Streaming?

Re: Reading parquet files into Spark Streaming

2016-08-26 Thread Renato Marroquín Mogrovejo
Anybody? I think Rory also didn't get an answer from the list ... https://mail-archives.apache.org/mod_mbox/spark-user/201602.mbox/%3ccac+fre14pv5nvqhtbvqdc+6dkxo73odazfqslbso8f94ozo...@mail.gmail.com%3E 2016-08-26 17:42 GMT+02:00 Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com>: > Hi

Re: is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread kant kodali
te: is there a HTTP2 (v2) endpoint for Spark Streaming?

Re: is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread Jacek Laskowski
awiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Aug 26, 2016 at 9:42 PM, kant kodali wrote: > is there a HTTP2 (v2) endpoint for Spark Streaming? >

is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread kant kodali
is there a HTTP2 (v2) endpoint for Spark Streaming?

Reading parquet files into Spark Streaming

2016-08-26 Thread Renato Marroquín Mogrovejo
Hi all, I am trying to use parquet files as input for DStream operations, but I can't find any documentation or example. The only thing I found was [1] but I also get the same error as in the post (Class parquet.avro.AvroReadSupport not found). Ideally I would like to do have something like this:

Re: Latest Release of Receiver based Kafka Consumer for Spark Streaming.

2016-08-25 Thread Dibyendu Bhattacharya
> On Aug 25, 2016, at 6:33 AM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > > Hi , > > Released latest version of Receiver based Kafka Consumer for Spark > Streaming. > > Receiver is compatible with Kafka versions 0.8.x, 0.9.x and 0.10.x and All >

Re: Latest Release of Receiver based Kafka Consumer for Spark Streaming.

2016-08-25 Thread mdkhajaasmath
t; Released latest version of Receiver based Kafka Consumer for Spark Streaming. > > Receiver is compatible with Kafka versions 0.8.x, 0.9.x and 0.10.x and All > Spark Versions > > Available at Spark Packages : > https://spark-packages.org/package/dibbhatt/kafka-spark-consumer >

Latest Release of Receiver based Kafka Consumer for Spark Streaming.

2016-08-25 Thread Dibyendu Bhattacharya
Hi , Released latest version of Receiver based Kafka Consumer for Spark Streaming. Receiver is compatible with Kafka versions 0.8.x, 0.9.x and 0.10.x and All Spark Versions Available at Spark Packages : https://spark-packages.org/package/dibbhatt/kafka-spark-consumer Also at github : https

Spark Streaming user function exceptions causing hangs

2016-08-24 Thread N B
Hello, We have a Spark streaming application (running Spark 1.6.1) that consumes data from a message queue. The application is running in local[*] mode so driver and executors are in a single JVM. The issue that we are dealing with these days is that if any of our lambda functions throw any

How to compute a net (difference) given a bi-directional stream of numbers using spark streaming?

2016-08-24 Thread kant kodali
Hi Guys, I am new to spark but I am wondering how do I compute the difference given a bidirectional stream of numbers using spark streaming? To put it more concrete say Bank A is sending money to Bank B and Bank B is sending money to Bank A throughout the day such that at any given time we want

Re: Is "spark streaming" streaming or mini-batch?

2016-08-24 Thread Mich Talebzadeh
Is "spark streaming" streaming or mini-batch? I look at something Like Complex Event Processing (CEP) which is a leader use case for data streaming (and I am experimenting with Spark for it) and in the realm of CEP there is really no such thing as continuous data streaming. The point is

Re: Spark Streaming application failing with Token issue

2016-08-24 Thread Jacek Laskowski
Hi Steve, Thanks a lot for such an elaborative email (though it brought more questions than answers but it's because I'm new to YARN in general and Kerberos/tokens/tickets in particular). Thanks also for liking my notes. I'm very honoured to hear it from you. I value your work with Spark/YARN/Had

Re: Is "spark streaming" streaming or mini-batch?

2016-08-24 Thread Steve Loughran
On 23 Aug 2016, at 17:58, Mich Talebzadeh mailto:mich.talebza...@gmail.com>> wrote: In general depending what you are doing you can tighten above parameters. For example if you are using Spark Streaming for Anti-fraud detection, you may stream data in at 2 seconds batch interval, Kee

Re: Spark Streaming application failing with Token issue

2016-08-24 Thread Steve Loughran
> On 23 Aug 2016, at 11:26, Jacek Laskowski wrote: > > Hi Steve, > > Could you share your opinion on whether the token gets renewed or not? > Is the token going to expire after 7 days anyway? There's Hadoop service tokens, and Kerberos tickets. They are similar-ish, but not quite the same.

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Matei Zaharia
e/blog/storm-in-a-teacup/> and it mentioned that spark > streaming actually mini-batch not actual streaming. > > I have not used streaming and I am not sure what is the difference in the 2 > terms. Hence could not make a judgement myself. >

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Aseem Bansal
Thanks everyone for clarifying. On Tue, Aug 23, 2016 at 9:11 PM, Aseem Bansal wrote: > I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ > and it mentioned that spark streaming actually mini-batch not actual > streaming. > > I have not used streaming an

How do we process/scale variable size batches in Apache Spark Streaming

2016-08-23 Thread Rachana Srivastava
I am running a spark streaming process where I am getting batch of data after n seconds. I am using repartition to scale the application. Since the repartition size is fixed we are getting lots of small files when batch size is very small. Is there anyway I can change the partitioner logic

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Mich Talebzadeh
nds over a previous 10 minutes period of time In general depending what you are doing you can tighten above parameters. For example if you are using Spark Streaming for Anti-fraud detection, you may stream data in at 2 seconds batch interval, Keep your windows length at 4 seconds and your sliding int

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Russell Spitzer
Spark streaming does not process 1 event at a time which is in general I think what people call "Streaming." It instead processes groups of events. Each group is a "MicroBatch" that gets processed at the same time. Streaming theoretically always has better latency because th

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread pandees waran
It's based on "micro batching" model. Sent from my iPhone > On Aug 23, 2016, at 8:41 AM, Aseem Bansal wrote: > > I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ and > it mentioned that spark streaming actually mini-batch not actual strea

Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Aseem Bansal
I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ and it mentioned that spark streaming actually mini-batch not actual streaming. I have not used streaming and I am not sure what is the difference in the 2 terms. Hence could not make a judgement myself.

Re:Log rollover in spark streaming jobs

2016-08-23 Thread prosp4300
files, and logs can be accessed using YARN’s log utility. You can get more information here: https://spark.apache.org/docs/latest/running-on-yarn.html#configuration At 2016-08-23 18:44:29, "Pradeep" wrote: >Hi All, > >I am running Java spark streaming jobs in yarn-client m

Log rollover in spark streaming jobs

2016-08-23 Thread Pradeep
Hi All, I am running Java spark streaming jobs in yarn-client mode. Is there a way I can manage logs rollover on edge node. I have a 10 second batch and log file volume is huge. Thanks, Pradeep - To unsubscribe e-mail: user

Re: Spark Streaming application failing with Token issue

2016-08-23 Thread Jacek Laskowski
Hi Steve, Could you share your opinion on whether the token gets renewed or not? Is the token going to expire after 7 days anyway? Why is the change in the recent version for token renewal? See https://github.com/apache/spark/commit/ab648c0004cfb20d53554ab333dd2d198cb94ffa Pozdrawiam, Jacek Lasko

Re: Spark Streaming application failing with Token issue

2016-08-23 Thread Steve Loughran
On 21 Aug 2016, at 20:43, Mich Talebzadeh mailto:mich.talebza...@gmail.com>> wrote: Hi Kamesh, The message you are getting after 7 days: PriviledgedActionException as:sys_bio_replicator (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$

Re: Spark Streaming application failing with Token issue

2016-08-21 Thread Mich Talebzadeh
> > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Thu, Aug 18, 2016 at 11:51 AM, Kamesh wrote: > > Hi all, > > > > I am running a spark streaming

Re: Spark Streaming application failing with Token issue

2016-08-21 Thread Jacek Laskowski
https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Thu, Aug 18, 2016 at 11:51 AM, Kamesh wrote: > Hi all, > > I am running a spark streaming application that store events into > Secur

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-19 Thread chandan prakash
Ohh that explains the reason. My use case does not need state management. So i guess i am better off without checkpointing. Thank you for clarification. Regards, Chandan On Sat, Aug 20, 2016 at 8:24 AM, Cody Koeninger wrote: > Checkpointing is required to be turned on in certain situations (e.g

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-19 Thread Cody Koeninger
Checkpointing is required to be turned on in certain situations (e.g. updateStateByKey), but you're certainly not required to rely on it for fault tolerance. I try not to. On Fri, Aug 19, 2016 at 1:51 PM, chandan prakash wrote: > Thanks Cody for the pointer. > > I am able to do this now. Not us

Re: Spark streaming 2, giving error ClassNotFoundException: scala.collection.GenTraversableOnce$class

2016-08-19 Thread Mich Talebzadeh
Thanks --jars /home/hduser/jars/spark-streaming-kafka-assembly_*2.11*-1.6.1.jar sorted it out Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV

Re: Spark streaming 2, giving error ClassNotFoundException: scala.collection.GenTraversableOnce$class

2016-08-19 Thread Tathagata Das
You seem to combining Scala 2.10 and 2.11 libraries - your sbt project is 2.11, where as you are trying to pull in spark-streaming-kafka-assembly_ *2.10*-1.6.1.jar. On Fri, Aug 19, 2016 at 11:24 AM, Mich Talebzadeh wrote: > Hi, > > My spark streaming app with 1.6.1 used to work. >

Spark streaming 2, giving error ClassNotFoundException: scala.collection.GenTraversableOnce$class

2016-08-19 Thread Mich Talebzadeh
Hi, My spark streaming app with 1.6.1 used to work. Now with scala> sc version res0: String = 2.0.0 Compiling with sbt assembly as before, with the following: version := "1.0", scalaVersion := "2.11.8", mainClass in Compile := Some(&

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-19 Thread chandan prakash
Thanks Cody for the pointer. I am able to do this now. Not using checkpointing. Rather storing offsets in zookeeper for fault tolerance. Spark Config changes now getting reflected in code deployment. *Using this api :* *KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (S

Spark streaming

2016-08-18 Thread Diwakar Dhanuskodi
Hi, Is there a way to  specify in  createDirectStream to receive only last 'n' offsets of a specific topic and partition. I don't want to filter out in foreachRDD.   Sent from Samsung Mobile.

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread chandan prakash
Yes, i looked into the source code implementation. sparkConf is serialized and saved during checkpointing and re-created from the checkpoint directory at time of restart. So any sparkConf parameter which you load from application.config and set in sparkConf object in code cannot be changed and re

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread Cody Koeninger
Checkpointing is not kafka-specific. It encompasses metadata about the application. You can't re-use a checkpoint if your application has changed. http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code On Thu, Aug 18, 2016 at 4:39 AM, chandan prakash w

Spark Streaming application failing with Token issue

2016-08-18 Thread Kamesh
Hi all, I am running a spark streaming application that store events into Secure(Kerborized) HBase cluster. I launched this spark streaming application by passing --principal and --keytab. Despite this, spark streaming application is failing after *7days* with Token issue. Can someone please

Re: spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread chandan prakash
Is it possible that i use checkpoint directory to restart streaming but with modified parameter value in config file (e.g. username/password for db connection) ? Thanks in advance. Regards, Chandan On Thu, Aug 18, 2016 at 1:10 PM, chandan prakash wrote: > Hi, > I am using direct kafka with ch

spark streaming Directkafka with checkpointing : changed parameters not considered

2016-08-18 Thread chandan prakash
Hi, I am using direct kafka with checkpointing of offsets same as : https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/IdempotentExample.scala I need to change some parameters like db connection params : username/password for db connection . I stopped streaming grac

Spark Streaming fault tolerance benchmark

2016-08-13 Thread Dominik Safaric
A few months ago, I've started investigating part of an empirical research several stream processing engines, including but not limited to Spark Streaming. As the benchmark should extend the scope further from performance metrics such as throughput and latency, I've focused onto fault

Re: restart spark streaming app

2016-08-12 Thread Jacek Laskowski
n Fri, Aug 12, 2016 at 4:38 PM, Shifeng Xiao wrote: > Hi folks, > > I am using Spark streaming, and I am not clear if there is smart way to > restart the app once it fails, currently we just have one cron job to check > if the job is running every 2 or 5 minutes and restart the

restart spark streaming app

2016-08-12 Thread Shifeng Xiao
Hi folks, I am using Spark streaming, and I am not clear if there is smart way to restart the app once it fails, currently we just have one cron job to check if the job is running every 2 or 5 minutes and restart the app when necessary. According to spark streaming guide: - *YARN* - Yarn

Re: Why training data in Kmeans Spark streaming clustering

2016-08-11 Thread Bryan Cutler
The algorithm update is just broken into 2 steps: trainOn - to learn/update the cluster centers, and predictOn - predicts cluster assignment on data The StreamingKMeansExample you reference breaks up data into training and test because you might want to score the predictions. If you don't care ab

Re: Spark streaming not processing messages from partitioned topics

2016-08-11 Thread Diwakar Dhanuskodi
tering BlockManagerMaster >> > 16/08/10 18:16:27 INFO DiskBlockManager: Created local directory at >> > /tmp/blockmgr-0f110a7e-1edb-4140-9243-5579a7bc95ee >> > 16/08/10 18:16:27 INFO MemoryStore: MemoryStore started with capacity >> 511.5 >> > MB >>

Why training data in Kmeans Spark streaming clustering

2016-08-11 Thread Ahmed Sadek
Dear All, I was wondering why there is training data and testing data in kmeans ? Shouldn't it be unsupervised learning with just access to stream data ? I found similar question but couldn't understand the answer. http://stackoverflow.com/questions/30972057/is-the-streaming-k-means-clustering-pr

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Diwakar Dhanuskodi
> > 16/08/10 18:16:27 INFO DiskBlockManager: Created local directory at > > /tmp/blockmgr-0f110a7e-1edb-4140-9243-5579a7bc95ee > > 16/08/10 18:16:27 INFO MemoryStore: MemoryStore started with capacity > 511.5 > > MB > > 16/08/10 18:16:27 INFO SparkEnv: Registering Outp

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Cody Koeninger
t; http://192.168.126.131:4040 > 16/08/10 18:16:27 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-b60f692d-f5ea-44c1-aa21-ae132813828c/httpd-2b2a4e68-2952-41b0-a11b-f07860682749 > 16/08/10 18:16:27 INFO HttpServer: Starting HTTP Server > 16/08/10 18:16:27 INFO Utils:

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Diwakar Dhanuskodi
ileServer: HTTP File server directory is /tmp/spark-b60f692d-f5ea-44c1-aa21-ae132813828c/httpd-2b2a4e68-2952-41b0-a11b-f07860682749 16/08/10 18:16:27 INFO HttpServer: Starting HTTP Server 16/08/10 18:16:27 INFO Utils: Successfully started service 'HTTP file server' on port 59491. 16/08/10 18

suggestion needed on FileInput Path- Spark Streaming

2016-08-10 Thread mdkhajaasmath
what is best practice while processing files from s3 bucket in spark file streaming ?? Like I keep on getting files in s3 path, have to process those in batch but while processing some other files might come up. In this steaming job, should I have to move files after end of our streaming batch

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Cody Koeninger
7:52 INFO ShutdownHookManager: Deleting directory > /tmp/spark-6df1d6aa-896e-46e1-a2ed-199343dad0e2/httpd- > 07b9c1b6-01db-45b5-9302-d2f67f7c490e > 16/08/10 12:27:52 INFO RemoteActorRefProvider$RemotingTerminator: > Remoting shut down. > 16/08/10 12:27:52 INFO ShutdownHookManager: D

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Sivakumaran S
I am testing with one partition now. I am using Kafka 0.9 and Spark 1.6.1 (Scala 2.11). Just start with one topic first and then add more. I am not partitioning the topic. HTH, Regards, Sivakumaran > On 10-Aug-2016, at 5:56 AM, Diwakar Dhanuskodi > wrote: > > Hi Siva, > > Does topic has

Re: Spark streaming not processing messages from partitioned topics

2016-08-10 Thread Diwakar Dhanuskodi
12:27:52 INFO ShutdownHookManager: Deleting directory /tmp/spark-6df1d6aa-896e-46e1-a2ed-199343dad0e2 [cloudera@quickstart ~]$ spark-submit --master local[3] --class com.boa.poc.todelete --jars /home/cloudera/lib/spark-streaming-kafka-assembly_2.10-1.6.2.jar,/home/cloudera/lib/spark-assembly-1.6.2-h

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi
Hi Siva, Does topic has partitions? which version of Spark you are using? On Wed, Aug 10, 2016 at 2:38 AM, Sivakumaran S wrote: > Hi, > > Here is a working example I did. > > HTH > > Regards, > > Sivakumaran S > > val topics = "test" > val brokers = "localhost:9092" > val topicsSet = topics.sp

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Diwakar Dhanuskodi
It stops working at sqlContext.read.json(rdd.map(_._2)) . Topics without partitions is working fine. Do I need to set any other configs val kafkaParams = Map[String,String]("bootstrap.servers"->"localhost:9093,localhost:9092", " group.id" -> "xyz","auto.offset.reset"->"smallest") Spark version is 1

Re: Spark streaming not processing messages from partitioned topics

2016-08-09 Thread Cody Koeninger
No, you don't need a conditional. read.json on an empty rdd will return an empty dataframe. Foreach on an empty dataframe or an empty rdd won't do anything (a task will still get run, but it won't do anything). Leave the conditional out. Add one thing at a time to the working rdd.foreach exampl

<    6   7   8   9   10   11   12   13   14   15   >