Re: Spark streaming multiple kafka topic doesn't work at-least-once
If you haven't looked at the offset ranges in the logs for the time period in question, I'd start there. On Jan 24, 2017 2:51 PM, "Hakan İlter"wrote: Sorry for misunderstanding. When I said that, I meant there are no lag in consumer. Kafka Manager shows each consumer's coverage and lag status. On Tue, Jan 24, 2017 at 10:45 PM, Cody Koeninger wrote: > When you said " I check the offset ranges from Kafka Manager and don't > see any significant deltas.", what were you comparing it against? The > offset ranges printed in spark logs? > > On Tue, Jan 24, 2017 at 2:11 PM, Hakan İlter wrote: > > First of all, I can both see the "Input Rate" from Spark job's statistics > > page and Kafka producer message/sec from Kafka manager. The numbers are > > different when I have the problem. Normally these numbers are very near. > > > > Besides, the job is an ETL job, it writes the results to Elastic Search. > An > > another legacy app also writes the same results to a database. There are > > huge difference between DB and ES. I know how many records we process > daily. > > > > Everything works fine if I run a job instance for each topic. > > > > On Tue, Jan 24, 2017 at 5:26 PM, Cody Koeninger > wrote: > >> > >> I'm confused, if you don't see any difference between the offsets the > >> job is processing and the offsets available in kafka, then how do you > >> know it's processing less than all of the data? > >> > >> On Tue, Jan 24, 2017 at 12:35 AM, Hakan İlter > >> wrote: > >> > I'm using DirectStream as one stream for all topics. I check the > offset > >> > ranges from Kafka Manager and don't see any significant deltas. > >> > > >> > On Tue, Jan 24, 2017 at 4:42 AM, Cody Koeninger > >> > wrote: > >> >> > >> >> Are you using receiver-based or direct stream? > >> >> > >> >> Are you doing 1 stream per topic, or 1 stream for all topics? > >> >> > >> >> If you're using the direct stream, the actual topics and offset > ranges > >> >> should be visible in the logs, so you should be able to see more > >> >> detail about what's happening (e.g. all topics are still being > >> >> processed but offsets are significantly behind, vs only certain > topics > >> >> being processed but keeping up with latest offsets) > >> >> > >> >> On Mon, Jan 23, 2017 at 3:14 PM, hakanilter > >> >> wrote: > >> >> > Hi everyone, > >> >> > > >> >> > I have a spark (1.6.0-cdh5.7.1) streaming job which receives data > >> >> > from > >> >> > multiple kafka topics. After starting the job, everything works > fine > >> >> > first > >> >> > (like 700 req/sec) but after a while (couples of days or a week) it > >> >> > starts > >> >> > processing only some part of the data (like 350 req/sec). When I > >> >> > check > >> >> > the > >> >> > kafka topics, I can see that there are still 700 req/sec coming to > >> >> > the > >> >> > topics. I don't see any errors, exceptions or any other problem. > The > >> >> > job > >> >> > works fine when I start the same code with just single kafka topic. > >> >> > > >> >> > Do you have any idea or a clue to understand the problem? > >> >> > > >> >> > Thanks. > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > View this message in context: > >> >> > > >> >> > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-st > reaming-multiple-kafka-topic-doesn-t-work-at-least-once-tp28334.html > >> >> > Sent from the Apache Spark User List mailing list archive at > >> >> > Nabble.com. > >> >> > > >> >> > > - > >> >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >> >> > > >> > > >> > > > > > >
Re: Spark streaming multiple kafka topic doesn't work at-least-once
Sorry for misunderstanding. When I said that, I meant there are no lag in consumer. Kafka Manager shows each consumer's coverage and lag status. On Tue, Jan 24, 2017 at 10:45 PM, Cody Koeningerwrote: > When you said " I check the offset ranges from Kafka Manager and don't > see any significant deltas.", what were you comparing it against? The > offset ranges printed in spark logs? > > On Tue, Jan 24, 2017 at 2:11 PM, Hakan İlter wrote: > > First of all, I can both see the "Input Rate" from Spark job's statistics > > page and Kafka producer message/sec from Kafka manager. The numbers are > > different when I have the problem. Normally these numbers are very near. > > > > Besides, the job is an ETL job, it writes the results to Elastic Search. > An > > another legacy app also writes the same results to a database. There are > > huge difference between DB and ES. I know how many records we process > daily. > > > > Everything works fine if I run a job instance for each topic. > > > > On Tue, Jan 24, 2017 at 5:26 PM, Cody Koeninger > wrote: > >> > >> I'm confused, if you don't see any difference between the offsets the > >> job is processing and the offsets available in kafka, then how do you > >> know it's processing less than all of the data? > >> > >> On Tue, Jan 24, 2017 at 12:35 AM, Hakan İlter > >> wrote: > >> > I'm using DirectStream as one stream for all topics. I check the > offset > >> > ranges from Kafka Manager and don't see any significant deltas. > >> > > >> > On Tue, Jan 24, 2017 at 4:42 AM, Cody Koeninger > >> > wrote: > >> >> > >> >> Are you using receiver-based or direct stream? > >> >> > >> >> Are you doing 1 stream per topic, or 1 stream for all topics? > >> >> > >> >> If you're using the direct stream, the actual topics and offset > ranges > >> >> should be visible in the logs, so you should be able to see more > >> >> detail about what's happening (e.g. all topics are still being > >> >> processed but offsets are significantly behind, vs only certain > topics > >> >> being processed but keeping up with latest offsets) > >> >> > >> >> On Mon, Jan 23, 2017 at 3:14 PM, hakanilter > >> >> wrote: > >> >> > Hi everyone, > >> >> > > >> >> > I have a spark (1.6.0-cdh5.7.1) streaming job which receives data > >> >> > from > >> >> > multiple kafka topics. After starting the job, everything works > fine > >> >> > first > >> >> > (like 700 req/sec) but after a while (couples of days or a week) it > >> >> > starts > >> >> > processing only some part of the data (like 350 req/sec). When I > >> >> > check > >> >> > the > >> >> > kafka topics, I can see that there are still 700 req/sec coming to > >> >> > the > >> >> > topics. I don't see any errors, exceptions or any other problem. > The > >> >> > job > >> >> > works fine when I start the same code with just single kafka topic. > >> >> > > >> >> > Do you have any idea or a clue to understand the problem? > >> >> > > >> >> > Thanks. > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > View this message in context: > >> >> > > >> >> > http://apache-spark-user-list.1001560.n3.nabble.com/Spark- > streaming-multiple-kafka-topic-doesn-t-work-at-least-once-tp28334.html > >> >> > Sent from the Apache Spark User List mailing list archive at > >> >> > Nabble.com. > >> >> > > >> >> > > - > >> >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >> >> > > >> > > >> > > > > > >
Re: Spark streaming multiple kafka topic doesn't work at-least-once
When you said " I check the offset ranges from Kafka Manager and don't see any significant deltas.", what were you comparing it against? The offset ranges printed in spark logs? On Tue, Jan 24, 2017 at 2:11 PM, Hakan İlterwrote: > First of all, I can both see the "Input Rate" from Spark job's statistics > page and Kafka producer message/sec from Kafka manager. The numbers are > different when I have the problem. Normally these numbers are very near. > > Besides, the job is an ETL job, it writes the results to Elastic Search. An > another legacy app also writes the same results to a database. There are > huge difference between DB and ES. I know how many records we process daily. > > Everything works fine if I run a job instance for each topic. > > On Tue, Jan 24, 2017 at 5:26 PM, Cody Koeninger wrote: >> >> I'm confused, if you don't see any difference between the offsets the >> job is processing and the offsets available in kafka, then how do you >> know it's processing less than all of the data? >> >> On Tue, Jan 24, 2017 at 12:35 AM, Hakan İlter >> wrote: >> > I'm using DirectStream as one stream for all topics. I check the offset >> > ranges from Kafka Manager and don't see any significant deltas. >> > >> > On Tue, Jan 24, 2017 at 4:42 AM, Cody Koeninger >> > wrote: >> >> >> >> Are you using receiver-based or direct stream? >> >> >> >> Are you doing 1 stream per topic, or 1 stream for all topics? >> >> >> >> If you're using the direct stream, the actual topics and offset ranges >> >> should be visible in the logs, so you should be able to see more >> >> detail about what's happening (e.g. all topics are still being >> >> processed but offsets are significantly behind, vs only certain topics >> >> being processed but keeping up with latest offsets) >> >> >> >> On Mon, Jan 23, 2017 at 3:14 PM, hakanilter >> >> wrote: >> >> > Hi everyone, >> >> > >> >> > I have a spark (1.6.0-cdh5.7.1) streaming job which receives data >> >> > from >> >> > multiple kafka topics. After starting the job, everything works fine >> >> > first >> >> > (like 700 req/sec) but after a while (couples of days or a week) it >> >> > starts >> >> > processing only some part of the data (like 350 req/sec). When I >> >> > check >> >> > the >> >> > kafka topics, I can see that there are still 700 req/sec coming to >> >> > the >> >> > topics. I don't see any errors, exceptions or any other problem. The >> >> > job >> >> > works fine when I start the same code with just single kafka topic. >> >> > >> >> > Do you have any idea or a clue to understand the problem? >> >> > >> >> > Thanks. >> >> > >> >> > >> >> > >> >> > -- >> >> > View this message in context: >> >> > >> >> > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-multiple-kafka-topic-doesn-t-work-at-least-once-tp28334.html >> >> > Sent from the Apache Spark User List mailing list archive at >> >> > Nabble.com. >> >> > >> >> > - >> >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > >> > >> > > > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark streaming multiple kafka topic doesn't work at-least-once
First of all, I can both see the "Input Rate" from Spark job's statistics page and Kafka producer message/sec from Kafka manager. The numbers are different when I have the problem. Normally these numbers are very near. Besides, the job is an ETL job, it writes the results to Elastic Search. An another legacy app also writes the same results to a database. There are huge difference between DB and ES. I know how many records we process daily. Everything works fine if I run a job instance for each topic. On Tue, Jan 24, 2017 at 5:26 PM, Cody Koeningerwrote: > I'm confused, if you don't see any difference between the offsets the > job is processing and the offsets available in kafka, then how do you > know it's processing less than all of the data? > > On Tue, Jan 24, 2017 at 12:35 AM, Hakan İlter > wrote: > > I'm using DirectStream as one stream for all topics. I check the offset > > ranges from Kafka Manager and don't see any significant deltas. > > > > On Tue, Jan 24, 2017 at 4:42 AM, Cody Koeninger > wrote: > >> > >> Are you using receiver-based or direct stream? > >> > >> Are you doing 1 stream per topic, or 1 stream for all topics? > >> > >> If you're using the direct stream, the actual topics and offset ranges > >> should be visible in the logs, so you should be able to see more > >> detail about what's happening (e.g. all topics are still being > >> processed but offsets are significantly behind, vs only certain topics > >> being processed but keeping up with latest offsets) > >> > >> On Mon, Jan 23, 2017 at 3:14 PM, hakanilter > wrote: > >> > Hi everyone, > >> > > >> > I have a spark (1.6.0-cdh5.7.1) streaming job which receives data from > >> > multiple kafka topics. After starting the job, everything works fine > >> > first > >> > (like 700 req/sec) but after a while (couples of days or a week) it > >> > starts > >> > processing only some part of the data (like 350 req/sec). When I check > >> > the > >> > kafka topics, I can see that there are still 700 req/sec coming to the > >> > topics. I don't see any errors, exceptions or any other problem. The > job > >> > works fine when I start the same code with just single kafka topic. > >> > > >> > Do you have any idea or a clue to understand the problem? > >> > > >> > Thanks. > >> > > >> > > >> > > >> > -- > >> > View this message in context: > >> > http://apache-spark-user-list.1001560.n3.nabble.com/Spark- > streaming-multiple-kafka-topic-doesn-t-work-at-least-once-tp28334.html > >> > Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >> > > >> > - > >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >> > > > > > >
Re: Spark streaming multiple kafka topic doesn't work at-least-once
I'm confused, if you don't see any difference between the offsets the job is processing and the offsets available in kafka, then how do you know it's processing less than all of the data? On Tue, Jan 24, 2017 at 12:35 AM, Hakan İlterwrote: > I'm using DirectStream as one stream for all topics. I check the offset > ranges from Kafka Manager and don't see any significant deltas. > > On Tue, Jan 24, 2017 at 4:42 AM, Cody Koeninger wrote: >> >> Are you using receiver-based or direct stream? >> >> Are you doing 1 stream per topic, or 1 stream for all topics? >> >> If you're using the direct stream, the actual topics and offset ranges >> should be visible in the logs, so you should be able to see more >> detail about what's happening (e.g. all topics are still being >> processed but offsets are significantly behind, vs only certain topics >> being processed but keeping up with latest offsets) >> >> On Mon, Jan 23, 2017 at 3:14 PM, hakanilter wrote: >> > Hi everyone, >> > >> > I have a spark (1.6.0-cdh5.7.1) streaming job which receives data from >> > multiple kafka topics. After starting the job, everything works fine >> > first >> > (like 700 req/sec) but after a while (couples of days or a week) it >> > starts >> > processing only some part of the data (like 350 req/sec). When I check >> > the >> > kafka topics, I can see that there are still 700 req/sec coming to the >> > topics. I don't see any errors, exceptions or any other problem. The job >> > works fine when I start the same code with just single kafka topic. >> > >> > Do you have any idea or a clue to understand the problem? >> > >> > Thanks. >> > >> > >> > >> > -- >> > View this message in context: >> > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-multiple-kafka-topic-doesn-t-work-at-least-once-tp28334.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > - >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > > > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark streaming multiple kafka topic doesn't work at-least-once
I'm using DirectStream as one stream for all topics. I check the offset ranges from Kafka Manager and don't see any significant deltas. On Tue, Jan 24, 2017 at 4:42 AM, Cody Koeningerwrote: > Are you using receiver-based or direct stream? > > Are you doing 1 stream per topic, or 1 stream for all topics? > > If you're using the direct stream, the actual topics and offset ranges > should be visible in the logs, so you should be able to see more > detail about what's happening (e.g. all topics are still being > processed but offsets are significantly behind, vs only certain topics > being processed but keeping up with latest offsets) > > On Mon, Jan 23, 2017 at 3:14 PM, hakanilter wrote: > > Hi everyone, > > > > I have a spark (1.6.0-cdh5.7.1) streaming job which receives data from > > multiple kafka topics. After starting the job, everything works fine > first > > (like 700 req/sec) but after a while (couples of days or a week) it > starts > > processing only some part of the data (like 350 req/sec). When I check > the > > kafka topics, I can see that there are still 700 req/sec coming to the > > topics. I don't see any errors, exceptions or any other problem. The job > > works fine when I start the same code with just single kafka topic. > > > > Do you have any idea or a clue to understand the problem? > > > > Thanks. > > > > > > > > -- > > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Spark-streaming-multiple-kafka- > topic-doesn-t-work-at-least-once-tp28334.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > >
Re: Spark streaming multiple kafka topic doesn't work at-least-once
Are you using receiver-based or direct stream? Are you doing 1 stream per topic, or 1 stream for all topics? If you're using the direct stream, the actual topics and offset ranges should be visible in the logs, so you should be able to see more detail about what's happening (e.g. all topics are still being processed but offsets are significantly behind, vs only certain topics being processed but keeping up with latest offsets) On Mon, Jan 23, 2017 at 3:14 PM, hakanilterwrote: > Hi everyone, > > I have a spark (1.6.0-cdh5.7.1) streaming job which receives data from > multiple kafka topics. After starting the job, everything works fine first > (like 700 req/sec) but after a while (couples of days or a week) it starts > processing only some part of the data (like 350 req/sec). When I check the > kafka topics, I can see that there are still 700 req/sec coming to the > topics. I don't see any errors, exceptions or any other problem. The job > works fine when I start the same code with just single kafka topic. > > Do you have any idea or a clue to understand the problem? > > Thanks. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-multiple-kafka-topic-doesn-t-work-at-least-once-tp28334.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org