Re: Missing data in Kafka Consumer

2016-05-05 Thread Nirav Shah
I had same issue while using with storm. Than I found no of storm spout
instance should not be greater than no of partition.

if you increase that than nos were not matching.May be you can check
something similar for spark.


Regards,

Nirav
On May 5, 2016 9:48 PM, "Jerry"  wrote:

> Hi,
>
> Does anybody give me an idea why the data is lost at the Kafka Consumer
> side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes,
> I
> found out I could not receive the same number of data with Kafka producer.
> Exp) I sent 1000 data to Kafka Broker via Kafka Producer and confirmed the
> same number in the Broker. But when I checked either HDFS or Cassandra, the
> number is just 363. The data is not always lost, just sometimes... That's
> wired and annoying to me.
> Can anybody give me some reasons?
>
> Thanks!
> Jerry
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Missing data in Kafka Consumer

2016-05-05 Thread Cody Koeninger
Does that code even compile?  I'm assuming eventLogJson.foreach is
supposed to be eventLogJson.foreachRDD ?
I'm also confused as to why you're repartitioning to 1 partition.

Is your streaming job lagging behind (especially given that you're
basically single-threading it by repartitioning to 1 partition)?

Have you looked for any error logs or failed tasks during the time you
noticed missing messages?

Have you verified that you aren't attempting to overwrite hdfs paths?


On Thu, May 5, 2016 at 2:09 PM, Jerry Wong  wrote:
> Hi Cody,
>
> Thank you for quick response my question. I paste the main part of the code,
>
> val sparkConf = new SparkConf().setAppName("KafkaSparkConsumer")
>
> sparkConf.set("spark.cassandra.connection.host", "...")
> sparkConf.set("spark.broadcast.factory",
> "org.apache.spark.broadcast.HttpBroadcastFactory")
> sparkConf.set("spark.cores.max", args(0))
> sparkConf.set("spark.executor.memory", args(1))
> val kafka_broker = args(2)
> val kafka_topic = args(3)
> val hdfs_path = args(4)
> val ssc = new StreamingContext(sparkConf, 2)
> val topicsSet = Set[String](kafka_topic)
> val kafkaParams = Map[String, String]("metadata.broker.list" →
> kafka_broker)
> val messages = KafkaUtils.createDirectStream[String, String, StringDecoder,
> StringDecoder](ssc, kafkaParams, topicsSet)
> val lines = messages.repartition(1).map({ case (w, c) ⇒ (c)})
> val eventLogJson  = lines.filter(line => line.contains("eventType"))
>
> val eventlog =eventLogJson.foreach(json => {
> if(!json.isEmpty()){
> json.saveAsTextFile(hdfs_path+"/eventlogs/"+getTimeFormatToFile())
> }
> })
> ssc.start()
> ssc.awaitTermination()
>}
>def getTimeFormatToFile(): String = {
> val dateFormat =new SimpleDateFormat("_MM_dd_HH_mm_ss")
>val dt = new Date()
> val cg= new GregorianCalendar()
> cg.setTime(dt);
> return dateFormat.format(cg.getTime())
>   }
>
> Any information is needs?
>
> Thanks!
>
> On Thu, May 5, 2016 at 12:34 PM, Cody Koeninger  wrote:
>>
>> That's not much information to go on.  Any relevant code sample or log
>> messages?
>>
>> On Thu, May 5, 2016 at 11:18 AM, Jerry  wrote:
>> > Hi,
>> >
>> > Does anybody give me an idea why the data is lost at the Kafka Consumer
>> > side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2.
>> > Sometimes, I
>> > found out I could not receive the same number of data with Kafka
>> > producer.
>> > Exp) I sent 1000 data to Kafka Broker via Kafka Producer and confirmed
>> > the
>> > same number in the Broker. But when I checked either HDFS or Cassandra,
>> > the
>> > number is just 363. The data is not always lost, just sometimes...
>> > That's
>> > wired and annoying to me.
>> > Can anybody give me some reasons?
>> >
>> > Thanks!
>> > Jerry
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Missing data in Kafka Consumer

2016-05-05 Thread Jerry
Hi David,

Thank you for your response.
Before inserting to Cassandra, I had checked the data have already missed
at HDFS (My second step is to load data from HDFS and then insert to
Cassandra).

Can you send me the link relating this bug of 0.8.2?

Thank you!
Jerry

On Thu, May 5, 2016 at 12:38 PM, david.lewis [via Apache Spark User List] <
ml-node+s1001560n26888...@n3.nabble.com> wrote:

> It's possible Kafka is throwing an exception and erroneously returning
> acks (there is a known bug in 0.8.2 that I encountered when my harddisk
> that was keeping log files and holding the temporary snappy library was
> full).
> It's also possible that your messages are not unique when they are put
> into cassandra. Are all of your messages unique in they primary keys in
> your cassandra table?
>
> On Thu, May 5, 2016 at 10:18 AM, Jerry [via Apache Spark User List] <[hidden
> email] <http:///user/SendEmail.jtp?type=node&node=26888&i=0>> wrote:
>
>> Hi,
>>
>> Does anybody give me an idea why the data is lost at the Kafka Consumer
>> side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes,
>> I found out I could not receive the same number of data with Kafka
>> producer. Exp) I sent 1000 data to Kafka Broker via Kafka Producer and
>> confirmed the same number in the Broker. But when I checked either HDFS or
>> Cassandra, the number is just 363. The data is not always lost, just
>> sometimes... That's wired and annoying to me.
>> Can anybody give me some reasons?
>>
>> Thanks!
>> Jerry
>>
>> --------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
>
> --
> -David Lewis
>
> *Blyncsy, Inc.* |www.Blyncsy.com  <http://www.blyncsy.com/>
>
> This email contains confidential commercial information the disclosure of
> which would result in serious competitive and commercial injury.  As such,
> it is a protected record under the Utah Government Records Access
> Management Act.
>
> This message is confidential. It may also be privileged or otherwise
> protected by work product immunity or other legal rules. If you have
> received it by mistake, please let us know by e-mail reply and delete it
> from your system; you may not copy this message or disclose its contents to
> anyone. Please send us by fax any message containing deadlines as incoming
> e-mails are not screened for response deadlines. The integrity and security
> of this message cannot be guaranteed on the Internet.
>
>  P Please consider the environment before printing this email.
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887p26888.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Missing data in Kafka Consumer, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=26887&code=amVycnkua2luZzIud29uZ0BnbWFpbC5jb218MjY4ODd8MTYwMzcyMjg3MQ==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887p26890.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Missing data in Kafka Consumer

2016-05-05 Thread Cody Koeninger
That's not much information to go on.  Any relevant code sample or log messages?

On Thu, May 5, 2016 at 11:18 AM, Jerry  wrote:
> Hi,
>
> Does anybody give me an idea why the data is lost at the Kafka Consumer
> side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes, I
> found out I could not receive the same number of data with Kafka producer.
> Exp) I sent 1000 data to Kafka Broker via Kafka Producer and confirmed the
> same number in the Broker. But when I checked either HDFS or Cassandra, the
> number is just 363. The data is not always lost, just sometimes... That's
> wired and annoying to me.
> Can anybody give me some reasons?
>
> Thanks!
> Jerry
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Missing data in Kafka Consumer

2016-05-05 Thread Jerry
Hi,

Does anybody give me an idea why the data is lost at the Kafka Consumer
side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes, I
found out I could not receive the same number of data with Kafka producer.
Exp) I sent 1000 data to Kafka Broker via Kafka Producer and confirmed the
same number in the Broker. But when I checked either HDFS or Cassandra, the
number is just 363. The data is not always lost, just sometimes... That's
wired and annoying to me. 
Can anybody give me some reasons? 

Thanks!
Jerry  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Missing-data-in-Kafka-Consumer-tp26887.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org