Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-06 Thread Sethupathi T
Gabor, Thanks for the clarification. Thanks On Fri, Sep 6, 2019 at 12:38 AM Gabor Somogyi wrote: > Sethupathi, > > Let me extract then the important part what I've shared: > > 1. "This ensures that each Kafka source has its own consumer group that > does not face interference from any other

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-06 Thread Gabor Somogyi
Sethupathi, Let me extract then the important part what I've shared: 1. "This ensures that each Kafka source has its own consumer group that does not face interference from any other consumer" 2. Consumers may eat the data from each other, offset calculation may give back wrong result (that's

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Sethupathi T
Gabor, Thanks for the quick response and sharing about spark 3.0, we need to use spark streaming (KafkaUtils.createDirectStream) than structured streaming by following this document https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html and re-iterating the issue again for

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Sethupathi T
Gabor, Thanks for the quick response and sharing about spark 3.0, we need to use spark streaming (KafkaUtils.createDirectStream) than structured streaming by following this document https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html and re-iterating the issue again for

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Gabor Somogyi
Hi, Let me share Spark 3.0 documentation part (Structured Streaming and not DStreams what you've mentioned but still relevant): kafka.group.id string none streaming and batch The Kafka group id to use in Kafka consumer while reading from Kafka. Use this with caution. By default, each query

Re: spark stream kafka wait for all data process done

2019-08-01 Thread 刘 勇
Hi, You can set spark.streaming.kafka.backpressure.enable=true. If your tasks can't process larger data that this variable can control the kafka data into streaming speed. And you can increment your streaming process time window. Sent from my Samsung Galaxy smartphone. Original

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-30 Thread Cody Koeninger
You're using an older version of spark, with what looks like a manually included different version of the kafka-clients jar (1.0) than what that version of the spark connector was written to depend on (0.10.0.1), so there's no telling what's going on. On Wed, Aug 29, 2018 at 3:40 PM, Guillermo

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández
I can't... do you think that it's a possible bug of this version?? from Spark or Kafka? El mié., 29 ago. 2018 a las 22:28, Cody Koeninger () escribió: > Are you able to try a recent version of spark? > > On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández > wrote: > > I'm using Spark

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Cody Koeninger
Are you able to try a recent version of spark? On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández wrote: > I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this > exception and Spark dies. > > I couldn't see any error or problem among the machines, anybody has the >

Re: Spark 2 Kafka Direct Stream Consumer Issue

2017-05-24 Thread Jayadeep J
Could any of the experts kindly advise ? On Fri, May 19, 2017 at 6:00 PM, Jayadeep J wrote: > Hi , > > I would appreciate some advice regarding an issue we are facing in > Streaming Kafka Direct Consumer. > > We have recently upgraded our application with Kafka Direct

Re: Spark streaming + kafka error with json library

2017-03-30 Thread Srikanth
Thanks for the tip. That worked. When would one use the assembly? On Wed, Mar 29, 2017 at 7:13 PM, Tathagata Das wrote: > Try depending on "spark-streaming-kafka-0-10_2.11" (not the assembly) > > On Wed, Mar 29, 2017 at 9:59 AM, Srikanth

Re: Spark streaming + kafka error with json library

2017-03-29 Thread Tathagata Das
Try depending on "spark-streaming-kafka-0-10_2.11" (not the assembly) On Wed, Mar 29, 2017 at 9:59 AM, Srikanth wrote: > Hello, > > I'm trying to use "org.json4s" % "json4s-native" library in a spark > streaming + kafka direct app. > When I use the latest version of the

Re: [Spark Streaming+Kafka][How-to]

2017-03-22 Thread Cody Koeninger
Glad you got it worked out. That's cool as long as your use case doesn't actually require e.g. partition 0 to always be scheduled to the same executor across different batches. On Tue, Mar 21, 2017 at 7:35 PM, OUASSAIDI, Sami wrote: > So it worked quite well with a

Re: [Spark Streaming+Kafka][How-to]

2017-03-21 Thread OUASSAIDI, Sami
So it worked quite well with a coalesce, I was able to find an solution to my problem : Altough not directly handling the executor a good roundaway was to assign the desired partition to a specific stream through assign strategy and coalesce to a single partition then repeat the same process for

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread Michael Armbrust
Another option that would avoid a shuffle would be to use assign and coalesce, running two separate streams. spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "...") .option("assign", """{t0: {"0": }, t1:{"0": x}}""") .load() .coalesce(1) .writeStream

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread OUASSAIDI, Sami
@Cody : Duly noted. @Michael Ambrust : A repartition is out of the question for our project as it would be a fairly expensive operation. We tried looking into targeting a specific executor so as to avoid this extra cost and directly have well partitioned data after consuming the kafka topics. Also

Re: [Spark Streaming+Kafka][How-to]

2017-03-17 Thread Michael Armbrust
Sorry, typo. Should be a repartition not a groupBy. > spark.readStream > .format("kafka") > .option("kafka.bootstrap.servers", "...") > .option("subscribe", "t0,t1") > .load() > .repartition($"partition") > .writeStream > .foreach(... code to write to cassandra ...) >

Re: [Spark Streaming+Kafka][How-to]

2017-03-16 Thread Michael Armbrust
I think it should be straightforward to express this using structured streaming. You could ensure that data from a given partition ID is processed serially by performing a group by on the partition column. spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "...")

Re: [Spark Streaming+Kafka][How-to]

2017-03-16 Thread Cody Koeninger
Spark just really isn't a good fit for trying to pin particular computation to a particular executor, especially if you're relying on that for correctness. On Thu, Mar 16, 2017 at 7:16 AM, OUASSAIDI, Sami wrote: > > Hi all, > > So I need to specify how an executor

RE: Spark and Kafka integration

2017-01-12 Thread Phadnis, Varun
Cool! Thanks for your inputs Jacek and Mark! From: Mark Hamstra [mailto:m...@clearstorydata.com] Sent: 13 January 2017 12:59 To: Phadnis, Varun <phad...@sky.optymyze.com> Cc: user@spark.apache.org Subject: Re: Spark and Kafka integration See "API compatibility" in http://

Re: Spark and Kafka integration

2017-01-12 Thread Mark Hamstra
See "API compatibility" in http://spark.apache.org/versioning-policy.html While code that is annotated as Experimental is still a good faith effort to provide a stable and useful API, the fact is that we're not yet confident enough that we've got the public API in exactly the form that we want to

Re: Spark and Kafka integration

2017-01-12 Thread Jacek Laskowski
Hi Phadnis, I found this in http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html: > This version of the integration is marked as experimental, so the API is > potentially subject to change. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering

Re: spark streaming kafka connector questions

2016-09-16 Thread 毅程
Thanks, That is what I am missing. I have added cache before action, and that 2nd processing is avoided. 2016-09-10 5:10 GMT-07:00 Cody Koeninger : > Hard to say without seeing the code, but if you do multiple actions on an > Rdd without caching, the Rdd will be computed

Re: spark streaming kafka connector questions

2016-09-10 Thread Cody Koeninger
Hard to say without seeing the code, but if you do multiple actions on an Rdd without caching, the Rdd will be computed multiple times. On Sep 10, 2016 2:43 AM, "Cheng Yi" wrote: After some investigation, the problem i see is liked caused by a filter and union of the

Re: spark streaming kafka connector questions

2016-09-10 Thread Cheng Yi
After some investigation, the problem i see is liked caused by a filter and union of the dstream. if i just do kafka-stream -- process -- output operator, then there is no problem, one event will be fetched once. if i do kafka-stream -- process(1) - filter a stream A for later union --|

Re: spark streaming kafka connector questions

2016-09-10 Thread 毅程
Cody, Thanks for the message. 1. as you mentioned, I do find the version for kafka 0.10.1, I will use that, although lots of experimental tags. Thank you. 2. I have done thorough investigating, it is NOT the scenario where 1st process failed, then 2nd process triggered. 3. I do agree the session

Re: spark streaming kafka connector questions

2016-09-08 Thread Cody Koeninger
- If you're seeing repeated attempts to process the same message, you should be able to look in the UI or logs and see that a task has failed. Figure out why that task failed before chasing other things - You're not using the latest version, the latest version is for spark 2.0. There are two

Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Luciano Resende
bm.com/profiles/html/myProfileView.do> 8200 >> Warden Ave >> Markham, ON L6G 1C7 >> Canada >> >> >> >> - Original message - >> From: Cody Koeninger <c...@koeninger.org> >> To: Eric Ho <e...@analyticsmd.com> >> Cc: "u

Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Cody Koeninger
; Warden Ave > Markham, ON L6G 1C7 > Canada > > > > - Original message - > From: Cody Koeninger <c...@koeninger.org> > To: Eric Ho <e...@analyticsmd.com> > Cc: "user@spark.apache.org" <user@spark.apache.org> > Subject: Re: Spark to

Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Mihai Iacob
: Cody Koeninger <c...@koeninger.org>To: Eric Ho <e...@analyticsmd.com>Cc: "user@spark.apache.org" <user@spark.apache.org>Subject: Re: Spark to Kafka communication encrypted ?Date: Wed, Aug 31, 2016 10:34 AM  Encryption is only available in spark-streaming-kafka-0-10, not 0-

Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Cody Koeninger
Encryption is only available in spark-streaming-kafka-0-10, not 0-8. You enable it the same way you enable it for the Kafka project's new consumer, by setting kafka configuration parameters appropriately. http://kafka.apache.org/documentation.html#security_ssl On Wed, Aug 31, 2016 at 2:03 AM,

Re: Spark streaming Kafka Direct API + Multiple consumers

2016-07-07 Thread Rabin Banerjee
It's not required , *Simplified Parallelism:* No need to create multiple input Kafka streams and union them. With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel. So there is a one-to-one

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread trung kien
Ah right i see. Thank you very much. On May 25, 2016 11:11 AM, "Cody Koeninger" wrote: > There's an overloaded createDirectStream method that takes a map from > topicpartition to offset for the starting point of the stream. > > On Wed, May 25, 2016 at 9:59 AM, trung kien

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
There's an overloaded createDirectStream method that takes a map from topicpartition to offset for the starting point of the stream. On Wed, May 25, 2016 at 9:59 AM, trung kien wrote: > Thank Cody. > > I can build the mapping from time ->offset. However how can i pass this >

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread trung kien
Thank Cody. I can build the mapping from time ->offset. However how can i pass this offset to Spark Streaming job using that offset? ( using Direct Approach) On May 25, 2016 9:42 AM, "Cody Koeninger" wrote: > Kafka does not yet have meaningful time indexing, there's a kafka

Re: Spark Streaming - Kafka Direct Approach: re-compute from specific time

2016-05-25 Thread Cody Koeninger
Kafka does not yet have meaningful time indexing, there's a kafka improvement proposal for it but it has gotten pushed back to at least 0.10.1 If you want to do this kind of thing, you will need to maintain your own index from time to offset. On Wed, May 25, 2016 at 8:15 AM, trung kien

Re: Spark Streaming - Kafka - java.nio.BufferUnderflowException

2016-05-25 Thread Cody Koeninger
I'd fix the kafka version on the executor classpath (should be 0.8.2.1) before trying anything else, even if it may be unrelated to the actual error. Definitely don't upgrade your brokers to 0.9 On Wed, May 25, 2016 at 2:30 AM, Scott W wrote: > I'm running into below error

Re: Spark and Kafka direct approach problem

2016-05-04 Thread Mich Talebzadeh
This works spark 1.61, using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77) Kafka version 0.9.0.1 using scala-library-2.11.7.jar Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Spark and Kafka direct approach problem

2016-05-04 Thread Shixiong(Ryan) Zhu
It's because the Scala version of Spark and the Scala version of Kafka don't match. Please check them. On Wed, May 4, 2016 at 6:17 AM, أنس الليثي wrote: > NoSuchMethodError usually appears because of a difference in the library > versions. > > Check the version of the

Re: Spark and Kafka direct approach problem

2016-05-04 Thread أنس الليثي
NoSuchMethodError usually appears because of a difference in the library versions. Check the version of the libraries you downloaded, the version of spark, the version of Kafka. On 4 May 2016 at 16:18, Luca Ferrari wrote: > Hi, > > I’m new on Apache Spark and I’m trying

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
Let me be more detailed in my response: Kafka works on “at least once” semantics. Therefore, given your assumption that Kafka "will be operational", we can assume that at least once semantics will hold. At this point, it comes down to designing for consumer (really Spark Executor) resilience.

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
At that scale, it’s best not to do coordination at the application layer. How much of your data is transactional in nature {all, some, none}? By which I mean ACID-compliant. > On Apr 19, 2016, at 10:53 AM, Erwan ALLAIN wrote: > > Cody, you're right that was an

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
Cody, you're right that was an example. Target architecture would be 3 DCs :) Good point on ZK, I'll have to check that. About Spark, both instances will run at the same time but on different topics. That would be quite useless to have to 2DCs working on the same set of data. I just want, in case

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Cody Koeninger
Maybe I'm missing something, but I don't see how you get a quorum in only 2 datacenters (without splitbrain problem, etc). I also don't know how well ZK will work cross-datacenter. As far as the spark side of things goes, if it's idempotent, why not just run both instances all the time. On

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
I'm describing a disaster recovery but it can be used to make one datacenter offline for upgrade for instance. >From my point of view when DC2 crashes: *On Kafka side:* - kafka cluster will lose one or more broker (partition leader and replica) - partition leader lost will be reelected in the

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Jason Nerothin
It the main concern uptime or disaster recovery? > On Apr 19, 2016, at 9:12 AM, Cody Koeninger wrote: > > I think the bigger question is what happens to Kafka and your downstream data > store when DC2 crashes. > > From a Spark point of view, starting up a post-crash job in

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Cody Koeninger
I think the bigger question is what happens to Kafka and your downstream data store when DC2 crashes. >From a Spark point of view, starting up a post-crash job in a new data center isn't really different from starting up a post-crash job in the original data center. On Tue, Apr 19, 2016 at 3:32

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-19 Thread Erwan ALLAIN
Thanks Jason and Cody. I'll try to explain a bit better the Multi DC case. As I mentionned before, I'm planning to use one kafka cluster and 2 or more spark cluster distinct. Let's say we have the following DCs configuration in a nominal case. Kafka partitions are consumed uniformly by the 2

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Cody Koeninger
The current direct stream only handles exactly the partitions specified at startup. You'd have to restart the job if you changed partitions. https://issues.apache.org/jira/browse/SPARK-12177 has the ongoing work towards using the kafka 0.10 consumer, which would allow for dynamic topicparittions

Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

2016-04-18 Thread Jason Nerothin
Hi Erwan, You might consider InsightEdge: http://insightedge.io . It has the capability of doing WAN between data grids and would save you the work of having to re-invent the wheel. Additionally, RDDs can be shared between developers in the same DC. Thanks, Jason >

Re: Spark Caching Kafka Metadata

2016-02-01 Thread Benjamin Han
Is there another way to create topics from Spark? Is there any reason the above code snippet would still produce this error? I've dumbly inserted waits and retries for testing, but that still doesn't consistently work, even after waiting several minutes. On Fri, Jan 29, 2016 at 8:29 AM, Cody

Re: Spark Caching Kafka Metadata

2016-01-29 Thread Cody Koeninger
The kafka direct stream doesn't do any explicit caching. I haven't looked through the underlying simple consumer code in the kafka project in detail, but I doubt it does either. Honestly, I'd recommend not using auto created topics (it makes it too easy to pollute your topics if someone

Re: Spark Streaming + Kafka + scala job message read issue

2016-01-15 Thread vivek.meghanathan
nathan (WT01 - NEP) <vivek.meghanat...@wipro.com>; duc.was.h...@gmail.com; user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue Hi Bryan, Yes we are using only 1 thread per topic as we have only one Kafka server with 1 partition. What kind of logs will tell us

RE: Spark Streaming + Kafka + scala job message read issue

2016-01-05 Thread vivek.meghanathan
Meghanathan (WT01 - NEP) Sent: 27 December 2015 11:08 To: Bryan <bryan.jeff...@gmail.com> Cc: Vivek Meghanathan (WT01 - NEP) <vivek.meghanat...@wipro.com>; duc.was.h...@gmail.com; user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue Hi Bryan, Yes

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-26 Thread Bryan
...@gmail.com Cc: duc.was.h...@gmail.com; vivek.meghanat...@wipro.com; user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue Hi Brian,PhuDuc, All 8 jobs are consuming 8 different IN topics. 8 different Scala jobs running each topic map mentioned below has only 1 thread

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-26 Thread vivek.meghanathan
l.com> Cc: duc.was.h...@gmail.com<mailto:duc.was.h...@gmail.com>; vivek.meghanat...@wipro.com<mailto:vivek.meghanat...@wipro.com>; user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark Streaming + Kafka + scala job message read issue Hi Brian,PhuDuc, All 8

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread vivek.meghanathan
Any help is highly appreciated, i am completely stuck here.. From: Vivek Meghanathan (WT01 - NEP) Sent: Thursday, December 24, 2015 7:50 PM To: Bryan; user@spark.apache.org Subject: RE: Spark Streaming + Kafka + scala job message read issue We are using

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread vivek.meghanathan
t; for Windows 10 phone From: PhuDuc Nguyen<mailto:duc.was.h...@gmail.com> Sent: Friday, December 25, 2015 3:35 PM To: vivek.meghanat...@wipro.com<mailto:vivek.meghanat...@wipro.com> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark Streaming + Kafka +

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread Bryan
...@wipro.com Sent: Friday, December 25, 2015 2:18 PM To: bryan.jeff...@gmail.com; user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue Any help is highly appreciated, i am completely stuck here.. From: Vivek Meghanathan (WT01 - NEP) Sent: Thursday, December 24

Re: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread PhuDuc Nguyen
gt; parse(line._2).extract[Search]) > > > > > > Regards, > Vivek M > > *From:* Bryan [mailto:bryan.jeff...@gmail.com] > *Sent:* 24 December 2015 17:20 > *To:* Vivek Meghanathan (WT01 - NEP) <vivek.meghanat...@wipro.com>; > user@spark.apache.org > *Subjec

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-25 Thread Bryan
Agreed. I did not see that they were using the same group name. Sent from Outlook Mail for Windows 10 phone From: PhuDuc Nguyen Sent: Friday, December 25, 2015 3:35 PM To: vivek.meghanat...@wipro.com Cc: user@spark.apache.org Subject: Re: Spark Streaming + Kafka + scala job message read issue

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-24 Thread Bryan
Are you using a direct stream consumer, or the older receiver based consumer? If the latter, do the number of partitions you’ve specified for your topic match the number of partitions in the topic on Kafka? That would be an possible cause – as you might receive all data from a given partition

RE: Spark Streaming + Kafka + scala job message read issue

2015-12-24 Thread vivek.meghanathan
anathan (WT01 - NEP) <vivek.meghanat...@wipro.com>; user@spark.apache.org Subject: RE: Spark Streaming + Kafka + scala job message read issue Are you using a direct stream consumer, or the older receiver based consumer? If the latter, do the number of partitions you’ve specified fo

RE: Spark and Kafka Integration

2015-12-07 Thread Singh, Abhijeet
For Q2. The order of the logs in each partition is guaranteed but there cannot be any such thing as global order. From: Prashant Bhardwaj [mailto:prashant2006s...@gmail.com] Sent: Monday, December 07, 2015 5:46 PM To: user@spark.apache.org Subject: Spark and Kafka Integration Hi Some

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread srungarapu vamsi
@Adrian, I am doing collect for debugging purpose. But i have to use foreachRDD so that i can operate on top of this rdd and eventually save to DB. But my actual problem here is to properly convert Array[Byte] to my custom object. On Thu, Sep 17, 2015 at 7:04 PM, Adrian Tanase

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Adrian Tanase
I guess what I'm asking is why not start with a Byte array like in the example that works (using the DefaultDecoder) then map over it and do the decoding manually like I'm suggesting below. Have you tried this approach? We have the same workflow (kafka => protobuf => custom class) and it

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Adrian Tanase
Why are you calling foreachRdd / collect in the first place? Instead of using a custom decoder, you should simply do – this is code executed on the workers and allows the computation to continue. ForeachRdd and collect are output operations and force the data to be collected on the driver

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Saisai Shao
Is your "KafkaGenericEvent" serializable? Since you call rdd.collect() to fetch the data to local driver, so this KafkaGenericEvent need to be serialized and deserialized through Java or Kryo (depends on your configuration) serializer, not sure if it is your problem to always get a default object.

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread srungarapu vamsi
@Saisai Shao, Thanks for the pointer. It turned out to be the serialization issue. I was using scalabuff to generate my "KafkaGenericEvent" class. But when i went through the generated class code, i figured out that it is not serializable. Now i am generating my classes using scalapb (

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Adrian Tanase
Good catch! BTW, great choice with ScalaPB, we moved from scalabuff as well, in order to generate the classes at compile time from sbt. Sent from my iPhone On 17 Sep 2015, at 22:00, srungarapu vamsi > wrote: @Saisai Shao, Thanks for

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread srungarapu vamsi
If i understand correctly, i guess you are suggesting me to do this : val kafkaDStream = KafkaUtils.createDirectStream[String,Array[Byte],StringDecoder,DefaultDecoder](ssc, kafkaConf, Set(topics)) kafkaDStream.map{ case(devId,byteArray)

Re: Spark based Kafka Producer

2015-09-11 Thread Atul Kulkarni
Folks, Any help on this? Regards, Atul. On Fri, Sep 11, 2015 at 8:39 AM, Atul Kulkarni wrote: > Hi Raghavendra, > > Thanks for your answers, I am passing 10 executors and I am not sure if > that is the problem. It is still hung. > > Regards, > Atul. > > > On Fri, Sep

Re: Spark based Kafka Producer

2015-09-11 Thread Raghavendra Pandey
You can pass the number of executors via command line option --num-executors.You need more than 2 executors to make spark-streaming working. For more details on command line option, please go through http://spark.apache.org/docs/latest/running-on-yarn.html. On Fri, Sep 11, 2015 at 10:52 AM,

Re: Spark based Kafka Producer

2015-09-11 Thread Atul Kulkarni
Slight update: The following code with "spark context" works, with wild card file paths in hard coded strings but it won't work with a value parsed out of the program arguments as above: val sc = new SparkContext(sparkConf) val zipFileTextRDD =

Re: Spark based Kafka Producer

2015-09-11 Thread Atul Kulkarni
Hi Raghavendra, Thanks for your answers, I am passing 10 executors and I am not sure if that is the problem. It is still hung. Regards, Atul. On Fri, Sep 11, 2015 at 12:40 AM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > You can pass the number of executors via command line

Re: Spark based Kafka Producer

2015-09-10 Thread Atul Kulkarni
I am submitting the job with yarn-cluster mode. spark-submit --master yarn-cluster ... On Thu, Sep 10, 2015 at 7:50 PM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > What is the value of spark master conf.. By default it is local, that > means only one thread can run and that is

Re: Spark based Kafka Producer

2015-09-10 Thread Raghavendra Pandey
What is the value of spark master conf.. By default it is local, that means only one thread can run and that is why your job is stuck. Specify it local[*], to make thread pool equal to number of cores... Raghav On Sep 11, 2015 6:06 AM, "Atul Kulkarni" wrote: > Hi Folks,

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread Umesh Kacha
Hi Cody sorry my bad you were right there was a typo in topicSet. When I corrected typo in topicSet it started working. Thanks a lot. Regards On Thu, Jul 30, 2015 at 7:43 PM, Cody Koeninger c...@koeninger.org wrote: Can you post the code including the values of kafkaParams and topicSet,

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread Cody Koeninger
Can you post the code including the values of kafkaParams and topicSet, ideally the relevant output of kafka-topics.sh --describe as well On Wed, Jul 29, 2015 at 11:39 PM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi thanks for the response. Like I already mentioned in the question kafka topic

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-30 Thread gaurav sharma
I have run into similar excpetions ERROR DirectKafkaInputDStream: ArrayBuffer(java.net.SocketTimeoutException, org.apache.spark.SparkException: Couldn't find leader offsets for Set([AdServe,1])) and the issue has happened on Kafka Side, where my broker offsets go out of sync, or do not return

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Tathagata Das
There is a known issue that Kafka cannot return leader if there is not data in the topic. I think it was raised in another thread in this forum. Is that the issue? On Wed, Jul 29, 2015 at 10:38 AM, unk1102 umesh.ka...@gmail.com wrote: Hi I have Spark Streaming code which streams from Kafka

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Umesh Kacha
Hi thanks for the response. Like I already mentioned in the question kafka topic is valid and it has data I can see data in it using another kafka consumer. On Jul 30, 2015 7:31 AM, Cody Koeninger c...@koeninger.org wrote: The last time someone brought this up on the mailing list, the issue

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Cody Koeninger
The last time someone brought this up on the mailing list, the issue actually was that the topic(s) didn't exist in Kafka at the time the spark job was running. On Wed, Jul 29, 2015 at 6:17 PM, Tathagata Das t...@databricks.com wrote: There is a known issue that Kafka cannot return leader

Re: spark streaming kafka compatibility

2015-07-09 Thread Shushant Arora
Thanks cody, so is it means if old kafka consumer 0.8.1.1 works with kafka cluster version 0.8.2 then spark streaming 1.3 should also work? I have tested standalone consumer kafka consumer 0.8.0 with kafka cluster 0.8.2 and that works. On Thu, Jul 9, 2015 at 9:58 PM, Cody Koeninger

Re: spark streaming kafka compatibility

2015-07-09 Thread Cody Koeninger
Yes, it should work, let us know if not. On Thu, Jul 9, 2015 at 11:34 AM, Shushant Arora shushantaror...@gmail.com wrote: Thanks cody, so is it means if old kafka consumer 0.8.1.1 works with kafka cluster version 0.8.2 then spark streaming 1.3 should also work? I have tested standalone

Re: spark streaming kafka compatibility

2015-07-09 Thread Cody Koeninger
It's the consumer version. Should work with 0.8.2 clusters. On Thu, Jul 9, 2015 at 11:10 AM, Shushant Arora shushantaror...@gmail.com wrote: Does spark streaming 1.3 requires kafka version 0.8.1.1 and is not compatible with kafka 0.8.2 ? As per maven dependency of spark streaming 1.3 with

Re: Spark Streaming + Kafka failure recovery

2015-05-21 Thread Bill Jay
Hi Cody, That is clear. Thanks! Bill On Tue, May 19, 2015 at 1:27 PM, Cody Koeninger c...@koeninger.org wrote: If you checkpoint, the job will start from the successfully consumed offsets. If you don't checkpoint, by default it will start from the highest available offset, and you will

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Cody Koeninger
If you checkpoint, the job will start from the successfully consumed offsets. If you don't checkpoint, by default it will start from the highest available offset, and you will potentially lose data. Is the link I posted, or for that matter the scaladoc, really not clear on that point? The

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Cody Koeninger
Have you read https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md ? 1. There's nothing preventing that. 2. Checkpointing will give you at-least-once semantics, provided you have sufficient kafka retention. Be aware that checkpoints aren't recoverable if you upgrade code.

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay
If a Spark streaming job stops at 12:01 and I resume the job at 12:02. Will it still start to consume the data that were produced to Kafka at 12:01? Or it will just start consuming from the current time? On Tue, May 19, 2015 at 10:58 AM, Cody Koeninger c...@koeninger.org wrote: Have you read

Re: Spark Streaming Kafka Avro NPE on deserialization of payload

2015-05-02 Thread Akhil Das
There was a similar discussion over here http://mail-archives.us.apache.org/mod_mbox/spark-user/201411.mbox/%3ccakz4c0s_cuo90q2jxudvx9wc4fwu033kx3-fjujytxxhr7p...@mail.gmail.com%3E Thanks Best Regards On Fri, May 1, 2015 at 7:12 PM, Todd Nist tsind...@gmail.com wrote: *Resending as I do not

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
Use KafkaRDD directly. It is in spark-streaming-kafka package On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi I want to consume messages from kafka queue using spark batch program not spark streaming, Is there any way to achieve this, other than using low

RE: spark with kafka

2015-04-18 Thread Ganelin, Ilya
Write Kafka stream to HDFS via Spark streaming then ingest files via Spark from HDFS. Sent with Good (www.good.com) -Original Message- From: Shushant Arora [shushantaror...@gmail.commailto:shushantaror...@gmail.com] Sent: Saturday, April 18, 2015 06:44 AM Eastern Standard Time To:

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
KafkaRDD uses the simple consumer api. and i think you need to handle offsets yourself, unless things changed since i last looked. I would do second approach. On Sat, Apr 18, 2015 at 2:42 PM, Shushant Arora shushantaror...@gmail.com wrote: Thanks !! I have few more doubts : Does kafka RDD

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
I mean to say it is simpler in case of failures, restarts, upgrades, etc. Not just failures. But they did do a lot of work on streaming from kafka in spark 1.3.x to make it simpler (streaming simple calls KafkaRDD for every batch if you use KafkaUtils.createDirectStream), so maybe i am wrong and

Re: Spark and Kafka

2014-11-06 Thread Eduardo Costa Alfaia
This is my window: reduceByKeyAndWindow( new Function2Integer, Integer, Integer() { @Override public Integer call(Integer i1, Integer i2) { return i1 + i2; } }, new Function2Integer, Integer, Integer() { public Integer call(Integer i1, Integer i2) {

Re: spark-streaming-kafka with broadcast variable

2014-09-05 Thread Tathagata Das
I am not sure if there is a good, clean way to do that - broadcasts variables are not designed to be used out side spark job closures. You could try a bit of a hacky stuff where you write the serialized variable to file in HDFS / NFS / distributed files sytem, and then use a custom decoder class

Re: [Spark Streaming] kafka consumer announce

2014-08-29 Thread Evgeniy Shishkin
TD, can you please comment on this code? I am really interested in including this code in Spark. But i am bothering about some point about persistence: 1. When we extend Receiver and call store, is it blocking call? Does it return only when spark stores rdd as requested (i.e. replicated or

Re: [Spark Streaming] kafka consumer announce

2014-08-21 Thread Evgeniy Shishkin
On 21 Aug 2014, at 20:25, Tim Smith wrote: Thanks. Discovering kafka metadata from zookeeper instead of brokers is nicer. Saving metadata and offsets to HBase, is that optional or mandatory? Can it be made optional (default to zookeeper)? For now we implemented and somewhat hardcoded

RE: [spark-streaming] kafka source and flow control

2014-08-12 Thread Gwenhael Pasquiers
issue with file (hdfs) inputs ? how can I be sure the input won’t “overflow” the process chain ? From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: mardi 12 août 2014 02:58 To: Gwenhael Pasquiers Cc: u...@spark.incubator.apache.org Subject: Re: [spark-streaming] kafka source and flow control

  1   2   >