Re: Official Kafka Disaster Recovery is insufficient - Suggestions needed

2018-09-07 Thread Manjunath N
Henning,

> It is my understanding that you produce messages to Kafka partitions using 
> the normal producer API and then subsequently ETL them to some cold storage 
> using one or more consumers, i.e. the cold storage is eventually consistent 
> with Kafka!? 

In some of the deployments i have worked. We keep unprocessed raw data for 3 
days for DR purposes. This is even before we process it through ETL.

I am not sure how your data is structured to implement this scenario but here 
is one of the ways you could work around if you have timestamp for each record 
in the raw data.

Each record produced into kafka broker has a CREATE_TIME timestamp. When 
consumers read->process->write to sink. irrespective of what offset we were 
able to commit in kafka before a failure, we can always find the last record 
processed and updated successfully from the sink. Before restarting a consumer 
we can run a identityconsumer to just identify the offset for the last 
successfully updated record in the sink and starting from last committed offset 
we retrieve records and check for the matching timestamp. Once the offset is 
identified for the last successfully updated record timestamp we seek to it in 
ConsumerRebalanceListener’s onPartitionsAssigned method and start processing 
from where we crashed. 

I haven’t tried, i am just sharing what came in my mind.

For bad messages too, you should be able to restore correct state if kafka is 
used as event store by reading from the raw data.

Manjunath

> On Sep 4, 2018, at 7:40 PM, Henning Røigaard-Petersen  wrote:
> 
> Thank you for your answer Ryanne. It’s always a pleasure to be presented with 
> such unambiguous advice. It really gives you something to work with :). 
> 
> To any other readers, I am very interested in hearing of other approaches to 
> DR in Kafka. 
> 
> Ryanne, I agree with your statement as to the probability difference of the 
> different DR scenarios, and I get in principle how your approach would allow 
> us to recover from “bad” messages, but we must of course ensure that we have 
> counter measures for all the scenarios. 
> 
> To that end, I have a couple of questions to your approach to DR.
> 
> Q1) 
> It is my understanding that you produce messages to Kafka partitions using 
> the normal producer API and then subsequently ETL them to some cold storage 
> using one or more consumers, i.e. the cold storage is eventually consistent 
> with Kafka!? 
> 
> If this is true, isn’t your approach prone to the same loss-of-tail issues as 
> regular multi cluster replication in case of total ISR loss? That is, we may 
> end up with an inconsistent cold storage, because downstream messages may be 
> backed up before the corresponding upstream messages are backed up?
> 
> I guess some ways around this would be to have only one partition (not 
> feasible) or to store state changes directly to other storage and ETL those 
> changes back to Kafka for downstream consummation. However, I believe that is 
> not what you are suggesting.
> 
> Q2) 
> I am unsure how you approach should work in practice, concerning the likely 
> disaster scenario of bad messages. 
> 
> Assume a bad message is produced and ETL’ed to the cold storage. 
> 
> As an isolated message, we could simply wipe the Kafka partition and 
> reproduce all relevant messages or compact the bad message with a newer 
> version. This all makes sense. 
> However, more likely, it will not be an isolated bad message, but rather a 
> plethora of downstream consumers will process it and in turn produce derived 
> bad messages, which are further processed downstream. This could result in an 
> enormous amount of bad messages and bad state in cold storage. 
> 
> How would you recover in this case? 
> 
> It might be possible to iterate through the entirety of the state to detect 
> bad messages, but updating with the correct data seems impossible.
> 
> I guess one very crude fallback solution may be to identify the root bad 
> message, and somehow restore to a previous consistent state for the entire 
> system. This however, requires some global message property across the entire 
> system. You mention Timestamps, but traditionally these are intrinsically 
> unreliable, especially in a distributed environment, and will most likely 
> lead to loss of messages with timestamps close to the root bad message.
> 
> Q3) 
> Does the statement “Don't rely on unlimited retention in Kafka” imply some 
> flaw in the implementation, or is it simply a reference to the advice of not 
> using Kafka as Source of Truth due to the DR issues?
> 
> Thank you for your time
> 
> Henning Røigaard-Petersen
> 
> -Original Message-
> From: Ryanne Dolan  
> Sent: 3. september 2018 20:27
> To: users@kafka.apache.org
> Subject: Re: Official Kafka Disaster Recovery is insufficient - Suggestions 
> needed
> 
> Sorry to have misspelled your name Henning.
> 
> On Mon, Sep 3, 2018, 1:26 PM Ryanne Dolan  wrote:
> 
>> Hanning,
>> 
>> In 

Re: Requesting for JIRA permission as contributor/committer

2018-08-10 Thread Manjunath N
Hi Guozhang,

Could please add manj...@gmail.com  too. So that i 
can take up some open issues to contribute.

Thanks
Manjunath

> On Aug 10, 2018, at 9:13 PM, Guozhang Wang  wrote:
> 
> It's done.
> 
> Cheers,
> Guozhang
> 
> On Fri, Aug 10, 2018 at 6:34 AM, M. Manna  wrote:
> 
>> manmedia - (email - manme...@gmail.com)
>> 
>> On 10 August 2018 at 14:34, Guozhang Wang  wrote:
>> 
>>> Hello,
>>> 
>>> What is your apache id?
>>> 
>>> 
>>> Guozhang
>>> 
>>> On Fri, Aug 10, 2018 at 12:50 AM, M. Manna  wrote:
>>> 
 Hello,
 
 Is it possible to get permission for assigning JIRA tickets. Just
>> wanted
>>> to
 start with some minor bugs/features.
 
 Regards,
 
>>> 
>>> 
>>> 
>>> --
>>> -- Guozhang
>>> 
>> 
> 
> 
> 
> -- 
> -- Guozhang



Re: Zookeeper logging “exception causing close of session 0x0” infinitely in logs

2018-08-06 Thread Manjunath N
Check this thread. Not sure if it is the same case. you could also enable trace 
and see if you can find more information.

http://zookeeper-user.578899.n2.nabble.com/chatty-error-Exception-causing-close-of-session-0x0-due-to-java-io-IOException-Len-error-1835955314-td7198695.html
 


Also, if this host is the zookeeper leader you can restart and observe if the 
same behavior is reproducible when the leadership moves to another host.


> On Aug 6, 2018, at 3:46 PM, Shantanu Deshmukh  wrote:
> 
> We have a cluster of 3 kafka+zookeeper. Only on one of our zookeeper
> servers we are seeing these logs infinitely getting written in
> zookeeper.out log file
> 
> WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCxn@1033] -
> Exception causing close of session 0x0 due to java.io.Exception
> INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCxn@1033] -
> Closed socket connection from /10.189.177.31:65429 (no session
> established for client)
> 
> I have no idea what this server 10.189.177.31 is. No Kafka consumer is
> running on this machine.
> 
> No changes to zookeeper was made. Due to disk full issue cluster had
> crashed. So we started all brokers and zookeepers. Everything went well
> except for this one. What can be done for this case? As zk is logging so
> heavily it will fill up the disk again with just these logs. Please help



Re: Apache Kafka Process showing high CPU (100 to 200+) usage in Linux when Idle

2018-08-05 Thread Manjunath N
After you deleted a topic was it a clean delete. Did you verify in zookeeper 
and kafka logs directory? if not you may need to do some clean up if there are 
inconsistency in kafka logs dir and zookeeper.
did you try to move the replicas assignment to different machines for this 
topic and see if it behaves same way on other machines for this particular 
topic?
Check number of open file handle before you start writing to this topic and 
after you kick off the writer/producer on all the replica machines for this 
topic.
Check how many log files are being created for the partition segment in kafka 
logs dir.
Is there anything in log files to trace back this behavior? if you could check 
and share any errors or warning it will help.


> On Aug 5, 2018, at 4:13 AM, Ted Yu  wrote:
> 
> bq. only one specific node is showing this issue
> 
> Is controller running on this node ? updating the metrics is expensive.
> 
> Cheers
> 
> On Sat, Aug 4, 2018 at 3:00 PM Abhijith Sreenivasan <
> abhijithonl...@gmail.com> wrote:
> 
>> Hello
>> 
>> We are seeing high CPU usage for the Kafka process. I am using 0.11
>> version. Has 5 topics out of which 1 was created newly. We attempted to
>> publish message this new topic which did not show up in the consumer, but
>> no errors in the publisher end. Not sure why the message did not show up in
>> consumer.
>> 
>> This ran for a couple of days (30K messages) when we noticed 100%+ CPU
>> usage. Tried deleting the topic (config is enabled), it was marked for
>> deletion but after which usage rose to below levels 240%+. We restarted the
>> process many times and disabled the publisher/producer but no difference.
>> After some time (1 or 2 hours) we are getting a "Too many open files" error
>> and process is shutting down.
>> 
>> We have 3 nodes with Kafka and 3 other nodes running with ZK, but only one
>> specific node is showing this issue. (where new topic partition is
>> present).
>> 
>> Still debugging and this is a prod environment.. please help!
>> 
>> Thanks,
>> Abhi
>> 
>> top - 17:47:43 up 289 days, 18:54,  2 users,  load average: 2.65, 2.72,
>> 2.52
>> Tasks: 144 total,   1 running, 143 sleeping,   0 stopped,   0 zombie
>> %Cpu(s): 37.5 us, 19.4 sy,  0.0 ni, 37.0 id,  0.0 wa,  0.0 hi,  5.4 si,
>> 0.7 st
>> KiB Mem : 16266464 total,  1431916 free,  5769976 used,  9064572 buff/cache
>> KiB Swap:0 total,0 free,0 used.  9230548 avail Mem
>> 
>>  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
>> 32058 root  20   0 5898348 1.078g  15548 S 253.0  6.9  99:03.68 java
>>   10 root  20   0   0  0  0 S   0.3  0.0 921:42.77
>> rcu_sche
>> 



Re: A Question about stopping brokers

2018-08-03 Thread Manjunath N
Hi Kyle,

When below command is executed to get the pid of the kafka process the CMD 
column output text is incomplete. It does not show process name kafka.kafka. 
Hence, the returned PID is blank and it displays “No kafka server to stop”. I 
faced this issue in version 1.0.0 but no tin 2.0.0. 

ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk '{print $1}’

Thanks
Manjunath

> On Aug 3, 2018, at 3:40 PM, Kyle.Hu  wrote:
> 
> Hi, ALL :
>  When I execute kafka-server-stop.sh in bin directory, I got an error “No 
> kafka server to stop” while my brokers running well, I don’t know how to stop 
> it for a long time, I always kill the precess when I like to stop it. What is 
> the recommended way to stop it ? Looking award to your reply.



Replica not coming up for a partition after restarting broker | Error NOT_LEADER_FOR_PARTITION

2018-08-01 Thread Manjunath N
Hi,

Problem: A replica for a partition is not coming up. Steps below;

I have a zookeeper cluster with three machines and a kafka cluster with 3 
brokers on the same machine. 

I created a topic test as below.
Operation: kafka-topics.sh --zookeeper rh3:2181/kafka --create --topic test 
--replication-factor 2 --partitions 1

Step 1

Status: kafka-topics.sh --zookeeper rh3:2181/kafka --describe
Topic:test  PartitionCount:1ReplicationFactor:2 
Configs:
Topic: test Partition: 0Leader: 5   
Replicas: 5,4   Isr: 5,4

Later i altered the topic to add a new partition: 

Step 2:
Operation: kafka-topics.sh --zookeeper rh3:2181/kafka --alter --topic test 
--partitions 2

Status: kafka-topics.sh --zookeeper rh3:2181/kafka --describe
Topic:test  PartitionCount:2ReplicationFactor:2 
Configs:
Topic: test Partition: 0Leader: 5   
Replicas: 5,4   Isr: 5,4
Topic: test Partition: 1Leader: 3   
Replicas: 3,4   Isr: 3,4

Step 3: now i shutdown broker 4

Status: kafka-topics.sh --zookeeper rh3:2181/kafka --describe
Topic:test  PartitionCount:2
ReplicationFactor:2 Configs:
Topic: test Partition: 0Leader: 5   
Replicas: 5,4   Isr: 5
Topic: test Partition: 1Leader: 3   
Replicas: 3,4   Isr: 3

Steep 4: Later, when i got broker up and running. The replica for partition 1 
is not syncing up. Here is the status.

Status: kafka-topics.sh --zookeeper rh3:2181/kafka --describe
Topic:test  PartitionCount:2
ReplicationFactor:2 Configs:
Topic: test Partition: 0Leader: 5   
Replicas: 5,4   Isr: 5,4
Topic: test Partition: 1Leader: 3   
Replicas: 3,4   Isr: 3

Log in broker 4 shows me this error. could you please let me know what could be 
done restore the replica for partition 1.

[2018-08-01 06:02:18,685] INFO [ReplicaFetcher replicaId=4, leaderId=3, 
fetcherId=0] Retrying leaderEpoch request for partition test-1 as the leader 
reported an error: NOT_LEADER_FOR_PARTITION (kafka.server.ReplicaFetcherThread)
[2018-08-01 06:02:19,690] INFO [ReplicaFetcher replicaId=4, leaderId=3, 
fetcherId=0] Retrying leaderEpoch request for partition test-1 as the leader 
reported an error: NOT_LEADER_FOR_PARTITION (kafka.server.ReplicaFetcherThread)

Thanks
Manjunath