Spark streaming / confluent Kafka- messages are empty

2022-06-09 Thread KhajaAsmath Mohammed
Hi, I am trying to read data from confluent Kafka using avro schema registry. Messages are always empty and stream always shows empty records. Any suggestion on this please ?? Thanks, Asmath - To unsubscribe e-mail:

Re: Spark streaming with Kafka

2020-11-03 Thread Kevin Pis
Hi, this is my Word Count demo. https://github.com/kevincmchen/wordcount MohitAbbi 于2020年11月4日周三 上午3:32写道: > Hi, > > Can you please share the correct versions of JAR files which you used to > resolve the issue. I'm also facing the same issue. > > Thanks > > > > > -- > Sent from:

Re: Spark streaming with Kafka

2020-11-03 Thread MohitAbbi
Hi, Can you please share the correct versions of JAR files which you used to resolve the issue. I'm also facing the same issue. Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe

Re: Spark Streaming with Kafka and Python

2020-08-12 Thread Sean Owen
tion.html > > Regards > > On Wed, 12 Aug 2020 at 14:12, Hamish Whittal > wrote: >> >> Hi folks, >> >> Thought I would ask here because it's somewhat confusing. I'm using Spark >> 2.4.5 on EMR 5.30.1 with Amazon MSK. >> >> The version of Scal

Re: Spark Streaming with Kafka and Python

2020-08-12 Thread German Schiavon
Thought I would ask here because it's somewhat confusing. I'm using Spark > 2.4.5 on EMR 5.30.1 with Amazon MSK. > > The version of Scala used is 2.11.12. I'm using this version of the > libraries spark-streaming-kafka-0-8_2.11-2.4.5.jar > > Now I'm wanting to read from Kafka

Spark Streaming with Kafka and Python

2020-08-12 Thread Hamish Whittal
Hi folks, Thought I would ask here because it's somewhat confusing. I'm using Spark 2.4.5 on EMR 5.30.1 with Amazon MSK. The version of Scala used is 2.11.12. I'm using this version of the libraries spark-streaming-kafka-0-8_2.11-2.4.5.jar Now I'm wanting to read from Kafka topics using Python

Re: Spark streaming with Kafka

2020-07-02 Thread dwgw
Hi I am able to correct the issue. The issue was due to wrong version of JAR file I have used. I have removed the these JAR files and copied correct version of JAR files and the error has gone away. Regards -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Spark streaming with Kafka

2020-07-02 Thread Jungtaek Lim
I can't reproduce. Could you please make sure you're running spark-shell with official spark 3.0.0 distribution? Please try out changing the directory and using relative path like "./spark-shell". On Thu, Jul 2, 2020 at 9:59 PM dwgw wrote: > Hi > I am trying to stream kafka topic from spark

Spark streaming with Kafka

2020-07-02 Thread dwgw
Hi I am trying to stream kafka topic from spark shell but i am getting the following error. I am using *spark 3.0.0/scala 2.12.10* (Java HotSpot(TM) 64-Bit Server VM, *Java 1.8.0_212*) *[spark@hdp-dev ~]$ spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0* Ivy Default Cache

Spark streaming with Kafka

2020-07-02 Thread dwgw
HiI am trying to stream kafka topic from spark shell but i am getting the following error. I am using *spark 3.0.0/scala 2.12.10* (Java HotSpot(TM) 64-Bit Server VM, *Java 1.8.0_212*)*[spark@hdp-dev ~]$ spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0*Ivy Default Cache set

Spark Streaming loading kafka source value column type

2019-03-01 Thread oskarryn
Hi, Why is `value` column in streamed dataframe obtained from kafka topic natively of binary type (look at the table https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html) if in fact it holds a string with the message's data and we CAST it as string anyways?

Spark streaming with kafka input stuck in (Re-)joing group because of group rebalancing

2018-05-15 Thread JF Chen
When I terminate a spark streaming application and restart it, it always stuck in this step: > > Revoking previously assigned partitions [] for group [mygroup] > (Re-)joing group [mygroup] If I use a new group id, even though it works fine, I may lose the data from the last time I read the

Re: Checkpoints not cleaned using Spark streaming + watermarking + kafka

2017-09-22 Thread MathieuP
The expected setting to clean these files is : - spark.sql.streaming.minBatchesToRetain More info on structured streaming settings : https://github.com/jaceklaskowski/spark-structured-streaming-book/blob/master/spark-sql-streaming-properties.adoc -- Sent from:

Checkpoints not cleaned using Spark streaming + watermarking + kafka

2017-09-21 Thread MathieuP
Hi Spark Users ! :) I come to you with a question about checkpoints. I have a streaming application that consumes and produces to Kafka. The computation requires a window and watermarking. Since this is a streaming application with a Kafka output, a checkpoint is expected. The application runs

Spark Streaming handling Kafka exceptions

2017-07-17 Thread Jean-Francois Gosselin
How can I handle an error with Kafka with my DirectStream (network issue, zookeeper or broker going down) ? For example when the consumer fails to connect with Kafka (at startup) I only get a DEBUG log (not even an ERROR) and no exception are thrown ... I'm using Spark 2.1.1 and spark-streaming

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
;>> What is it you're actually trying to accomplish? >>>>>> >>>>>> You can get topic, partition, and offset bounds from an offset range like >>>>>> >>>>>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Dominik Safaric
;>>> On 26 Apr 2017, at 19:26, Cody Koeninger <c...@koeninger.org> wrote: >>>>> >>>>> What is it you're actually trying to accomplish? >>>>> >>>>> You can get topic, partition, and offset bounds from an offset range like

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Cody Koeninger
...@koeninger.org> wrote: >>>> >>>> What is it you're actually trying to accomplish? >>>> >>>> You can get topic, partition, and offset bounds from an offset range like >>>> >>>> http://spark.apache.org/docs/latest/streaming-ka

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-27 Thread Dominik Safaric
-0-10-integration.html#obtaining-offsets >>> >>> Timestamp isn't really a meaningful idea for a range of offsets. >>> >>> >>> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric >>> <dominiksafa...@gmail.com> wrote: >>>> Hi

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
;> Timestamp isn't really a meaningful idea for a range of offsets. >> >> >> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric >> <dominiksafa...@gmail.com> wrote: >>> Hi all, >>> >>> Because the Spark Streaming direct Kafka consume

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Dominik Safaric
ntegration.html#obtaining-offsets > > Timestamp isn't really a meaningful idea for a range of offsets. > > > On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric > <dominiksafa...@gmail.com> wrote: >> Hi all, >> >> Because the Spark Streaming direct Kafka consumer maps

Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-26 Thread Cody Koeninger
25, 2017 at 2:43 PM, Dominik Safaric <dominiksafa...@gmail.com> wrote: > Hi all, > > Because the Spark Streaming direct Kafka consumer maps offsets for a given > Kafka topic and a partition internally while having enable.auto.commit set > to false, how can I retrieve th

Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll

2017-04-25 Thread Dominik Safaric
Hi all, Because the Spark Streaming direct Kafka consumer maps offsets for a given Kafka topic and a partition internally while having enable.auto.commit set to false, how can I retrieve the offset of each made consumer’s poll call using the offset ranges of an RDD? More precisely

Re: Spark streaming to kafka exactly once

2017-03-23 Thread Maurin Lenglart
Ok, Thanks for your answers On 3/22/17, 1:34 PM, "Cody Koeninger" wrote: If you're talking about reading the same message multiple times in a failure situation, see https://github.com/koeninger/kafka-exactly-once If you're talking about producing

Re: Spark streaming to kafka exactly once

2017-03-22 Thread Cody Koeninger
If you're talking about reading the same message multiple times in a failure situation, see https://github.com/koeninger/kafka-exactly-once If you're talking about producing the same message multiple times in a failure situation, keep an eye on

Re: Spark streaming to kafka exactly once

2017-03-22 Thread Matt Deaver
You have to handle de-duplication upstream or downstream. It might technically be possible to handle this in Spark but you'll probably have a better time handling duplicates in the service that reads from Kafka. On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart wrote: >

Spark streaming to kafka exactly once

2017-03-22 Thread Maurin Lenglart
Hi, we are trying to build a spark streaming solution that subscribe and push to kafka. But we are running into the problem of duplicates events. Right now, I am doing a “forEachRdd” and loop over the message of each partition and send those message to kafka. Is there any good way of solving

Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-21 Thread shyla deshpande
Thanks TD. On Tue, Mar 14, 2017 at 4:37 PM, Tathagata Das wrote: > This setting allows multiple spark jobs generated through multiple > foreachRDD to run concurrently, even if they are across batches. So output > op2 from batch X, can run concurrently with op1 of batch X+1

Re: Spark Streaming from Kafka, deal with initial heavy load.

2017-03-20 Thread Cody Koeninger
eiver.maxRate > > Thanks, > Edwin > > On Mar 18, 2017, 12:53 AM -0400, sagarcasual . <sagarcas...@gmail.com>, > wrote: > > Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct > approach. The streaming part works fine but when we initially start

Re: Spark Streaming from Kafka, deal with initial heavy load.

2017-03-18 Thread Mal Edwin
Hi, You can enable backpressure to handle this. spark.streaming.backpressure.enabled spark.streaming.receiver.maxRate Thanks, Edwin On Mar 18, 2017, 12:53 AM -0400, sagarcasual . <sagarcas...@gmail.com>, wrote: > Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic usi

Spark Streaming from Kafka, deal with initial heavy load.

2017-03-17 Thread sagarcasual .
Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct approach. The streaming part works fine but when we initially start the job, we have to deal with really huge Kafka message backlog, millions of messages, and that first batch runs for over 40 hours, and after 12 hours or so

Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-14 Thread Tathagata Das
This setting allows multiple spark jobs generated through multiple foreachRDD to run concurrently, even if they are across batches. So output op2 from batch X, can run concurrently with op1 of batch X+1 This is not safe because it breaks the checkpointing logic in subtle ways. Note that this was

Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-14 Thread shyla deshpande
Thanks TD for the response. Can you please provide more explanation. I am having multiple streams in the spark streaming application (Spark 2.0.2 using DStreams). I know many people using this setting. So your explanation will help a lot of people. Thanks On Fri, Mar 10, 2017 at 6:24 PM,

Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-10 Thread Tathagata Das
That config I not safe. Please do not use it. On Mar 10, 2017 10:03 AM, "shyla deshpande" wrote: > I have a spark streaming application which processes 3 kafka streams and > has 5 output operations. > > Not sure what should be the setting for

spark streaming with kafka source, how many concurrent jobs?

2017-03-10 Thread shyla deshpande
I have a spark streaming application which processes 3 kafka streams and has 5 output operations. Not sure what should be the setting for spark.streaming.concurrentJobs. 1. If the concurrentJobs setting is 4 does that mean 2 output operations will be run sequentially? 2. If I had 6 cores what

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
If you haven't looked at the offset ranges in the logs for the time period in question, I'd start there. On Jan 24, 2017 2:51 PM, "Hakan İlter" wrote: Sorry for misunderstanding. When I said that, I meant there are no lag in consumer. Kafka Manager shows each consumer's

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Hakan İlter
cs. After starting the job, everything works > fine > >> >> > first > >> >> > (like 700 req/sec) but after a while (couples of days or a week) it > >> >> > starts > >> >> > processing only some part of the data (like 350

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
; >> > starts >> >> > processing only some part of the data (like 350 req/sec). When I >> >> > check >> >> > the >> >> > kafka topics, I can see that there are still 700 req/sec coming to >> >> > the >> >> > to

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Hakan İlter
cs, I can see that there are still 700 req/sec coming to the > >> > topics. I don't see any errors, exceptions or any other problem. The > job > >> > works fine when I start the same code with just single kafka topic. > >> > > >> > Do you

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
single kafka topic. >> > >> > Do you have any idea or a clue to understand the problem? >> > >> > Thanks. >> > >> > >> > >> > -- >> > View this message in context: >> > http://apache-spark-user-lis

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-23 Thread Hakan İlter
; > > > Thanks. > > > > > > > > -- > > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Spark-streaming-multiple-kafka- > topic-doesn-t-work-at-least-once-tp28334.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > >

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-23 Thread Cody Koeninger
ew this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-multiple-kafka-topic-doesn-t-work-at-least-once-tp28334.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > >

Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-23 Thread hakanilter
. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-multiple-kafka-topic-doesn-t-work-at-least-once-tp28334.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Spark Streaming with Kafka

2016-12-12 Thread Anton Okolnychyi
7:11 GMT+01:00 Timur Shenkao <t...@timshenkao.su>: > >>> > >>> Hi, > >>> Usual general questions are: > >>> -- what is your Spark version? > >>> -- what is your Kafka version? > >>> -- do you use "standard" Kafka con

Re: Spark Streaming with Kafka

2016-12-12 Thread Cody Koeninger
sion? >>> -- do you use "standard" Kafka consumer or try to implement something >>> custom (your own multi-threaded consumer)? >>> >>> The freshest docs >>> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html >>> &g

Re: Spark Streaming with Kafka

2016-12-11 Thread Oleksii Dukhno
>> >> The freshest docs https://spark.apache.org/docs/ >> latest/streaming-kafka-0-10-integration.html >> >> AFAIK, yes, you should use unique group id for each stream (KAFKA 0.10 >> !!!) >> >>> kafkaParams.put("group.id", "use_a_separa

Re: Spark Streaming with Kafka

2016-12-11 Thread Anton Okolnychyi
1, 2016 at 5:51 PM, Anton Okolnychyi < > anton.okolnyc...@gmail.com> wrote: > >> Hi, >> >> I am experimenting with Spark Streaming and Kafka. I will appreciate if >> someone can say whether the following assumption is correct. >> >&

Re: Spark Streaming with Kafka

2016-12-11 Thread Timur Shenkao
kafka-0-10-integration.html AFAIK, yes, you should use unique group id for each stream (KAFKA 0.10 !!!) > kafkaParams.put("group.id", "use_a_separate_group_id_for_each_stream"); > > On Sun, Dec 11, 2016 at 5:51 PM, Anton Okolnychyi < anton.okolnyc...@gmail.com> wrote: >

Spark Streaming with Kafka

2016-12-11 Thread Anton Okolnychyi
Hi, I am experimenting with Spark Streaming and Kafka. I will appreciate if someone can say whether the following assumption is correct. If I have multiple computations (each with its own output) on one stream (created as KafkaUtils.createDirectStream), then there is a chance to have

Re: Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Karim, Md. Rezaul
Hi Tariq and Jon, At first thanks for quick response. I really appreciate that. Well, I would like to start from the very begging of using Kafka with Spark. For example, in the Spark distribution, I found an example using Kafka with Spark streaming that demonstrates a Direct Kafka Word Count

Re: Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Jon Gregg
16, 2016, Karim, Md. Rezaul < > rezaul.ka...@insight-centre.org> wrote: > >> Hi All, >> >> I am completely new with Kafka. I was wondering if somebody could provide >> me some guidelines on how to develop real-time streaming applications using >> Spark Stre

Re: Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Mohammad Tariq
w with Kafka. I was wondering if somebody could provide > me some guidelines on how to develop real-time streaming applications using > Spark Streaming API with Kafka. > > I am aware the Spark Streaming and Kafka integration [1]. However, a real > life example should be better to start? >

Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Karim, Md. Rezaul
Hi All, I am completely new with Kafka. I was wondering if somebody could provide me some guidelines on how to develop real-time streaming applications using Spark Streaming API with Kafka. I am aware the Spark Streaming and Kafka integration [1]. However, a real life example should be better

Re: Spark Streaming 2 Kafka 0.10 Integration for Aggregating Data

2016-10-18 Thread Sean Owen
/docs/latest/streaming-kafka-0-10-integration.html > > I've added that dependencies: > > > org.apache.spark > spark-streaming-kafka-0-10_2.11 > 2.0.1 > > > org.apache.spark > spark-core

Spark Streaming 2 Kafka 0.10 Integration for Aggregating Data

2016-10-18 Thread Furkan KAMACI
to it. I wanted to follow that example: http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html I've added that dependencies: org.apache.spark spark-streaming-kafka-0-10_2.11 2.0.1 org.apache.spark

Re: Spark Streaming with Kafka

2016-05-24 Thread Rasika Pohankar
/. Regards, Rasika. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Kafka-tp21222p27014.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Spark streaming from Kafka best fit

2016-03-07 Thread pratik khadloya
additional >>> processing. The question of which executor works on which tasks is up to >>> the scheduler (and getPreferredLocations, which only matters if you're >>> running spark on the same nodes as kafka) >>> >>> On Tue, Mar 1, 2016 at 2

Re: Spark streaming from Kafka best fit

2016-03-01 Thread Cody Koeninger
ther executors will do additional >> processing. The question of which executor works on which tasks is up to >> the scheduler (and getPreferredLocations, which only matters if you're >> running spark on the same nodes as kafka) >> >> On Tue, Mar 1, 2016 at 2:36 AM, Jatin Kumar &

Re: Spark streaming from Kafka best fit

2016-03-01 Thread Jatin Kumar
nning spark on > the same nodes as kafka) > > On Tue, Mar 1, 2016 at 2:36 AM, Jatin Kumar < > jku...@rocketfuelinc.com.invalid> wrote: > >> Hello all, >> >> I see that there are as of today 3 ways one can read from Kafka in spark >> streami

Re: Spark streaming from Kafka best fit

2016-03-01 Thread Cody Koeninger
h tasks is up to the scheduler (and getPreferredLocations, which only matters if you're running spark on the same nodes as kafka) On Tue, Mar 1, 2016 at 2:36 AM, Jatin Kumar < jku...@rocketfuelinc.com.invalid> wrote: > Hello all, > > I see that there are as of today 3 ways one

Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread Cody Koeninger
If by smaller block interval you mean the value in seconds passed to the streaming context constructor, no. You'll still get everything from the starting offset until now in the first batch. On Thu, Feb 18, 2016 at 10:02 AM, praveen S wrote: > Sorry.. Rephrasing : > Can

Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread praveen S
Sorry.. Rephrasing : Can this issue be resolved by having a smaller block interval? Regards, Praveen On 18 Feb 2016 21:30, "praveen S" wrote: > Can having a smaller block interval only resolve this? > > Regards, > Praveen > On 18 Feb 2016 21:13, "Cody Koeninger"

Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread praveen S
Can having a smaller block interval only resolve this? Regards, Praveen On 18 Feb 2016 21:13, "Cody Koeninger" wrote: > Backpressure won't help you with the first batch, you'd need > spark.streaming.kafka.maxRatePerPartition > for that > > On Thu, Feb 18, 2016 at 9:40 AM,

Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread Cody Koeninger
Backpressure won't help you with the first batch, you'd need spark.streaming.kafka.maxRatePerPartition for that On Thu, Feb 18, 2016 at 9:40 AM, praveen S wrote: > Have a look at > > spark.streaming.backpressure.enabled > Property > > Regards, > Praveen > On 18 Feb 2016

Re: Spark Streaming with Kafka Use Case

2016-02-18 Thread praveen S
Have a look at spark.streaming.backpressure.enabled Property Regards, Praveen On 18 Feb 2016 00:13, "Abhishek Anand" wrote: > I have a spark streaming application running in production. I am trying to > find a solution for a particular use case when my application has

Re: Spark Streaming with Kafka DirectStream

2016-02-17 Thread Cody Koeninger
You can print whatever you want wherever you want, it's just a question of whether it's going to show up on the driver or the various executors logs On Wed, Feb 17, 2016 at 5:50 AM, Cyril Scetbon wrote: > I don't think we can print an integer value in a spark streaming

Re: Spark Streaming with Kafka DirectStream

2016-02-17 Thread Cyril Scetbon
I don't think we can print an integer value in a spark streaming process As opposed to a spark job. I think I can print the content of an rdd but not debug messages. Am I wrong ? Cyril Scetbon > On Feb 17, 2016, at 12:51 AM, ayan guha wrote: > > Hi > > You can always

Re: Spark Streaming with Kafka Use Case

2016-02-17 Thread Cody Koeninger
Just use a kafka rdd in a batch job or two, then start your streaming job. On Wed, Feb 17, 2016 at 12:57 AM, Abhishek Anand wrote: > I have a spark streaming application running in production. I am trying to > find a solution for a particular use case when my

Re: Spark Streaming with Kafka DirectStream

2016-02-16 Thread ayan guha
Hi You can always use RDD properties, which already has partition information. https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html On Wed, Feb 17, 2016 at 2:36 PM, Cyril Scetbon wrote:

Re: Spark Streaming with Kafka DirectStream

2016-02-16 Thread Cyril Scetbon
Your understanding is the right one (having re-read the documentation). Still wondering how I can verify that 5 partitions have been created. My job is reading from a topic in Kafka that has 5 partitions and sends the data to E/S. I can see that when there is one task to read from Kafka there

Re: Spark Streaming with Kafka DirectStream

2016-02-16 Thread ayan guha
I have a slightly different understanding. Direct stream generates 1 RDD per batch, however, number of partitions in that RDD = number of partitions in kafka topic. On Wed, Feb 17, 2016 at 12:18 PM, Cyril Scetbon wrote: > Hi guys, > > I'm making some tests with Spark and

Spark Streaming with Kafka DirectStream

2016-02-16 Thread Cyril Scetbon
Hi guys, I'm making some tests with Spark and Kafka using a Python script. I use the second method that doesn't need any receiver (Direct Approach). It should adapt the number of RDDs to the number of partitions in the topic. I'm trying to verify it. What's the easiest way to verify it ? I

Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-12 Thread p pathiyil
Thanks Sebastian. I was indeed trying out FAIR scheduling with a high value for concurrentJobs today. It does improve the latency seen by the non-hot partitions, even if it does not provide complete isolation. So it might be an acceptable middle ground. On 12 Feb 2016 12:18, "Sebastian Piu"

Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread Sebastian Piu
Have you tried using fair scheduler and queues On 12 Feb 2016 4:24 a.m., "p pathiyil" wrote: > With this setting, I can see that the next job is being executed before > the previous one is finished. However, the processing of the 'hot' > partition eventually hogs all the

Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread p pathiyil
Hi, I am looking at a way to isolate the processing of messages from each Kafka partition within the same driver. Scenario: A DStream is created with the createDirectStream call by passing in a few partitions. Let us say that the streaming context is defined to have a time duration of 2 seconds.

Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread p pathiyil
With this setting, I can see that the next job is being executed before the previous one is finished. However, the processing of the 'hot' partition eventually hogs all the concurrent jobs. If there was a way to restrict jobs to be one per partition, then this setting would provide the

[Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread bernhard
All, I'm new to Spark and I'm having a hard time doing a simple join of two DFs Intent: - I'm receiving data from Kafka via direct stream and would like to enrich the messages with data from Cassandra. The Kafka messages (Protobufs) are decoded into DataFrames and then joined with a

RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
he.org Subject: [Spark Streaming] Joining Kafka and Cassandra DataFrames All, I'm new to Spark and I'm having a hard time doing a simple join of two DFs Intent: - I'm receiving data from Kafka via direct stream and would like to enrich the messages with data from Cassandra. The Kafka me

Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread bernhard
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch] Sent: Tuesday, February 9, 2016 10:05 PM To: Mohammed Guller Cc: user@spark.apache.org Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames Hi Mohammed Thanks for hint, I should probably do that :) As for the DF

Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread bernhard
Mohammed Author: Big Data Analytics with Spark -Original Message- From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch] Sent: Tuesday, February 9, 2016 10:47 PM To: Mohammed Guller Cc: user@spark.apache.org Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames Hi Mohammed

RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
rom: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch] Sent: Tuesday, February 9, 2016 10:05 PM To: Mohammed Guller Cc: user@spark.apache.org Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames Hi Mohammed Thanks for hint, I should probably do that :) As for the DF

RE: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread Mohammed Guller
with Spark -Original Message- From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch] Sent: Tuesday, February 9, 2016 10:47 PM To: Mohammed Guller Cc: user@spark.apache.org Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames Hi Mohammed I'm aware of that documentation, what

Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames

2016-02-09 Thread bernhard
rnh...@chapter7.ch] Sent: Tuesday, February 9, 2016 6:58 AM To: user@spark.apache.org Subject: [Spark Streaming] Joining Kafka and Cassandra DataFrames All, I'm new to Spark and I'm having a hard time doing a simple join of two DFs Intent: - I'm receiving data from Kafka via direct stream and wou

Spark Streaming: My kafka receivers are not consuming in parallel

2016-02-03 Thread Jorge Rodriguez
Hello Spark users, We are setting up our fist bach of spark streaming pipelines. And I am running into an issue which I am not sure how to resolve, but seems like should be fairly trivial. I am using receiver-mode Kafka consumer that comes with Spark, and running in standalone mode. I've setup

Re: Spark Streaming: My kafka receivers are not consuming in parallel

2016-02-03 Thread Jorge Rodriguez
Please ignore this question, as i've figured out what my problem was. In the case that anyone else runs into something similar, the problem was on the kafka side. I was using the console producer to generate the messages going into the kafka logs. This producer will send all of the messages to

Re: Spark Streaming with Kafka - batch DStreams in memory

2016-02-02 Thread Cody Koeninger
It's possible you could (ab)use updateStateByKey or mapWithState for this. But honestly it's probably a lot more straightforward to just choose a reasonable batch size that gets you a reasonable file size for most of your keys, then use filecrush or something similar to deal with the hdfs small

Spark Streaming with Kafka - batch DStreams in memory

2016-02-01 Thread p pathiyil
Hi, Are there any ways to store DStreams / RDD read from Kafka in memory to be processed at a later time ? What we need to do is to read data from Kafka, process it to be keyed by some attribute that is present in the Kafka messages, and write out the data related to each key when we have

Re: Spark Streaming 1.5.2+Kafka+Python. Strange reading

2015-12-24 Thread Akhil Das
Would you mind posting the relevant code snippet? Thanks Best Regards On Wed, Dec 23, 2015 at 7:33 PM, Vyacheslav Yanuk wrote: > Hi. > I have very strange situation with direct reading from Kafka. > For example. > I have 1000 messages in Kafka. > After submitting my

Spark Streaming 1.5.2+Kafka+Python. Strange reading

2015-12-23 Thread Vyacheslav Yanuk
Hi. I have very strange situation with direct reading from Kafka. For example. I have 1000 messages in Kafka. After submitting my application I read this data and process it. As I process the data I have accumulated 10 new entries. In next reading from Kafka I read only 3 records, but not 10!!!

Spark Streaming 1.5.2+Kafka+Python (docs)

2015-12-23 Thread Vyacheslav Yanuk
Colleagues Documents written about createDirectStream that "This does not use Zookeeper to store offsets. The consumed offsets are tracked by the stream itself. For interoperability with Kafka monitoring tools that depend on Zookeeper, you have to update Kafka/Zookeeper yourself from the

Re: Spark Streaming 1.5.2+Kafka+Python (docs)

2015-12-23 Thread Cody Koeninger
Read the documentation spark.apache.org/docs/latest/streaming-kafka-integration.html If you still have questions, read the resources linked from https://github.com/koeninger/kafka-exactly-once On Wed, Dec 23, 2015 at 7:24 AM, Vyacheslav Yanuk wrote: > Colleagues >

Re: Spark Streaming Specify Kafka Partition

2015-12-04 Thread Cody Koeninger
might be a question more fit for a scala mailing list > but google is failing me at the moment for hints on the interoperability of > scala and java generics. > > [1] The original createDirectStream: > https://github.com/apache/spark/blob/branch-1.5/external/kafka/src/main/scala/org/apache/

Higher Processing times in Spark Streaming with kafka Direct

2015-12-04 Thread SRK
Hi, Our processing times in Spark Streaming with kafka Direct approach seems to have increased considerably with increase in the Site traffic. Would increasing the number of kafka partitions decrease the processing times? Any suggestions on tuning to reduce the processing times would

Re: Higher Processing times in Spark Streaming with kafka Direct

2015-12-04 Thread u...@moosheimer.com
Am 04.12.2015 um 22:21 schrieb SRK <swethakasire...@gmail.com>: > > Hi, > > Our processing times in Spark Streaming with kafka Direct approach seems to > have increased considerably with increase in the Site traffic. Would > increasing the number of kafka partitions decrease

Re: Spark Streaming Specify Kafka Partition

2015-12-03 Thread Alan Braithwaite
/spark/streaming/kafka/KafkaUtils.scala#L395-L423 Thanks, - Alan On Tue, Dec 1, 2015 at 8:12 AM, Cody Koeninger <c...@koeninger.org> wrote: > I actually haven't tried that, since I tend to do the offset lookups if > necessary. > > It's possible that it will work, try it and l

Re: Spark Streaming Specify Kafka Partition

2015-12-01 Thread Alan Braithwaite
Neat, thanks. If I specify something like -1 as the offset, will it consume from the latest offset or do I have to instrument that manually? - Alan On Tue, Dec 1, 2015 at 6:43 AM, Cody Koeninger wrote: > Yes, there is a version of createDirectStream that lets you specify >

Re: Spark Streaming Specify Kafka Partition

2015-12-01 Thread Cody Koeninger
Yes, there is a version of createDirectStream that lets you specify fromOffsets: Map[TopicAndPartition, Long] On Mon, Nov 30, 2015 at 7:43 PM, Alan Braithwaite wrote: > Is there any mechanism in the kafka streaming source to specify the exact > partition id that we want a

Re: Spark Streaming Specify Kafka Partition

2015-12-01 Thread Cody Koeninger
I actually haven't tried that, since I tend to do the offset lookups if necessary. It's possible that it will work, try it and let me know. Be aware that if you're doing a count() or take() operation directly on the rdd it'll definitely give you the wrong result if you're using -1 for one of the

Spark Streaming Specify Kafka Partition

2015-11-30 Thread Alan Braithwaite
Is there any mechanism in the kafka streaming source to specify the exact partition id that we want a streaming job to consume from? If not, is there a workaround besides writing our a custom receiver? Thanks, - Alan

Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Ashish Soni
Hi All , Just wanted to find out if there is an benefits to installing kafka brokers and spark nodes on the same machine ? is it possible that spark can pull data from kafka if it is local to the node i.e. the broker or partition is on the same machine. Thanks, Ashish

  1   2   3   >