Re: Missing data in Kafka Consumer
I had same issue while using with storm. Than I found no of storm spout instance should not be greater than no of partition. if you increase that than nos were not matching.May be you can check something similar for spark. Regards, Nirav On May 5, 2016 9:48 PM, "Jerry" wrote: > Hi, > > Does anybody give me an idea why the data is lost at the Kafka Consumer > side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes, > I > found out I could not receive the same number of data with Kafka producer. > Exp) I sent 1000 data to Kafka Broker via Kafka Producer and confirmed the > same number in the Broker. But when I checked either HDFS or Cassandra, the > number is just 363. The data is not always lost, just sometimes... That's > wired and annoying to me. > Can anybody give me some reasons? > > Thanks! > Jerry > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Missing data in Kafka Consumer
Does that code even compile? I'm assuming eventLogJson.foreach is supposed to be eventLogJson.foreachRDD ? I'm also confused as to why you're repartitioning to 1 partition. Is your streaming job lagging behind (especially given that you're basically single-threading it by repartitioning to 1 partition)? Have you looked for any error logs or failed tasks during the time you noticed missing messages? Have you verified that you aren't attempting to overwrite hdfs paths? On Thu, May 5, 2016 at 2:09 PM, Jerry Wong wrote: > Hi Cody, > > Thank you for quick response my question. I paste the main part of the code, > > val sparkConf = new SparkConf().setAppName("KafkaSparkConsumer") > > sparkConf.set("spark.cassandra.connection.host", "...") > sparkConf.set("spark.broadcast.factory", > "org.apache.spark.broadcast.HttpBroadcastFactory") > sparkConf.set("spark.cores.max", args(0)) > sparkConf.set("spark.executor.memory", args(1)) > val kafka_broker = args(2) > val kafka_topic = args(3) > val hdfs_path = args(4) > val ssc = new StreamingContext(sparkConf, 2) > val topicsSet = Set[String](kafka_topic) > val kafkaParams = Map[String, String]("metadata.broker.list" → > kafka_broker) > val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, > StringDecoder](ssc, kafkaParams, topicsSet) > val lines = messages.repartition(1).map({ case (w, c) ⇒ (c)}) > val eventLogJson = lines.filter(line => line.contains("eventType")) > > val eventlog =eventLogJson.foreach(json => { > if(!json.isEmpty()){ > json.saveAsTextFile(hdfs_path+"/eventlogs/"+getTimeFormatToFile()) > } > }) > ssc.start() > ssc.awaitTermination() >} >def getTimeFormatToFile(): String = { > val dateFormat =new SimpleDateFormat("_MM_dd_HH_mm_ss") >val dt = new Date() > val cg= new GregorianCalendar() > cg.setTime(dt); > return dateFormat.format(cg.getTime()) > } > > Any information is needs? > > Thanks! > > On Thu, May 5, 2016 at 12:34 PM, Cody Koeninger wrote: >> >> That's not much information to go on. Any relevant code sample or log >> messages? >> >> On Thu, May 5, 2016 at 11:18 AM, Jerry wrote: >> > Hi, >> > >> > Does anybody give me an idea why the data is lost at the Kafka Consumer >> > side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. >> > Sometimes, I >> > found out I could not receive the same number of data with Kafka >> > producer. >> > Exp) I sent 1000 data to Kafka Broker via Kafka Producer and confirmed >> > the >> > same number in the Broker. But when I checked either HDFS or Cassandra, >> > the >> > number is just 363. The data is not always lost, just sometimes... >> > That's >> > wired and annoying to me. >> > Can anybody give me some reasons? >> > >> > Thanks! >> > Jerry >> > >> > >> > >> > -- >> > View this message in context: >> > http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > - >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Missing data in Kafka Consumer
Hi David, Thank you for your response. Before inserting to Cassandra, I had checked the data have already missed at HDFS (My second step is to load data from HDFS and then insert to Cassandra). Can you send me the link relating this bug of 0.8.2? Thank you! Jerry On Thu, May 5, 2016 at 12:38 PM, david.lewis [via Apache Spark User List] < ml-node+s1001560n26888...@n3.nabble.com> wrote: > It's possible Kafka is throwing an exception and erroneously returning > acks (there is a known bug in 0.8.2 that I encountered when my harddisk > that was keeping log files and holding the temporary snappy library was > full). > It's also possible that your messages are not unique when they are put > into cassandra. Are all of your messages unique in they primary keys in > your cassandra table? > > On Thu, May 5, 2016 at 10:18 AM, Jerry [via Apache Spark User List] <[hidden > email] <http:///user/SendEmail.jtp?type=node&node=26888&i=0>> wrote: > >> Hi, >> >> Does anybody give me an idea why the data is lost at the Kafka Consumer >> side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes, >> I found out I could not receive the same number of data with Kafka >> producer. Exp) I sent 1000 data to Kafka Broker via Kafka Producer and >> confirmed the same number in the Broker. But when I checked either HDFS or >> Cassandra, the number is just 363. The data is not always lost, just >> sometimes... That's wired and annoying to me. >> Can anybody give me some reasons? >> >> Thanks! >> Jerry >> >> -------------- >> If you reply to this email, your message will be added to the discussion >> below: >> >> http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html >> To unsubscribe from Apache Spark User List, click here. >> NAML >> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> > > > > -- > -David Lewis > > *Blyncsy, Inc.* |www.Blyncsy.com <http://www.blyncsy.com/> > > This email contains confidential commercial information the disclosure of > which would result in serious competitive and commercial injury. As such, > it is a protected record under the Utah Government Records Access > Management Act. > > This message is confidential. It may also be privileged or otherwise > protected by work product immunity or other legal rules. If you have > received it by mistake, please let us know by e-mail reply and delete it > from your system; you may not copy this message or disclose its contents to > anyone. Please send us by fax any message containing deadlines as incoming > e-mails are not screened for response deadlines. The integrity and security > of this message cannot be guaranteed on the Internet. > > P Please consider the environment before printing this email. > > > -- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887p26888.html > To start a new topic under Apache Spark User List, email > ml-node+s1001560n1...@n3.nabble.com > To unsubscribe from Missing data in Kafka Consumer, click here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=26887&code=amVycnkua2luZzIud29uZ0BnbWFpbC5jb218MjY4ODd8MTYwMzcyMjg3MQ==> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887p26890.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Missing data in Kafka Consumer
That's not much information to go on. Any relevant code sample or log messages? On Thu, May 5, 2016 at 11:18 AM, Jerry wrote: > Hi, > > Does anybody give me an idea why the data is lost at the Kafka Consumer > side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes, I > found out I could not receive the same number of data with Kafka producer. > Exp) I sent 1000 data to Kafka Broker via Kafka Producer and confirmed the > same number in the Broker. But when I checked either HDFS or Cassandra, the > number is just 363. The data is not always lost, just sometimes... That's > wired and annoying to me. > Can anybody give me some reasons? > > Thanks! > Jerry > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Missing data in Kafka Consumer
Hi, Does anybody give me an idea why the data is lost at the Kafka Consumer side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes, I found out I could not receive the same number of data with Kafka producer. Exp) I sent 1000 data to Kafka Broker via Kafka Producer and confirmed the same number in the Broker. But when I checked either HDFS or Cassandra, the number is just 363. The data is not always lost, just sometimes... That's wired and annoying to me. Can anybody give me some reasons? Thanks! Jerry -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org