Re: Can anyone help me to send messages in their original order?

2018-05-25 Thread Hans Jespersen
If you create a topic with one partition they will be in order.

Alternatively if you publish with the same key for every message they will be 
in the same order even if your topic has more than 1 partition.

Either way above will work for Kafka.

-hans

> On May 25, 2018, at 8:56 PM, Raymond Xie  wrote:
> 
> Hello,
> 
> I just started learning Kafka and have the environment setup on my
> hortonworks sandbox at home vmware.
> 
> test.csv is what I want the producer to send out:
> 
> more test1.csv ./kafka-console-producer.sh --broker-list
> sandbox.hortonworks.com:6667 --topic kafka-topic2
> 
> 1, abc
> 2, def
> ...
> 8, vwx
> 9, zzz
> 
> What I received are all the content of test.csv, however, not in their
> original order;
> 
> kafka-console-consumer.sh --zookeeper 192.168.112.129:2181 --topic
> kafka-topic2
> 
> 2, def
> 1, abc
> ...
> 9, zzz
> 8, vwx
> 
> 
> I read from google that partition could be the feasible solution, however,
> my questions are:
> 
> 1. for small files like this one, shall I really do the partitioning? how
> small a partition would be acceptable to ensure the sequence?
> 2. for big files, each partition could still contain multiple lines, how to
> ensure all the lines in each partition won't get messed up on consumer side?
> 
> 
> I also want to know what is the best practice to process large volume of
> data through kafka? There should be better way other than console command.
> 
> Thank you very much.
> 
> 
> 
> **
> *Sincerely yours,*
> 
> 
> *Raymond*


Can anyone help me to send messages in their original order?

2018-05-25 Thread Raymond Xie
Hello,

I just started learning Kafka and have the environment setup on my
hortonworks sandbox at home vmware.

test.csv is what I want the producer to send out:

more test1.csv ./kafka-console-producer.sh --broker-list
sandbox.hortonworks.com:6667 --topic kafka-topic2

1, abc
2, def
...
8, vwx
9, zzz

What I received are all the content of test.csv, however, not in their
original order;

kafka-console-consumer.sh --zookeeper 192.168.112.129:2181 --topic
kafka-topic2

2, def
1, abc
...
9, zzz
8, vwx


I read from google that partition could be the feasible solution, however,
my questions are:

1. for small files like this one, shall I really do the partitioning? how
small a partition would be acceptable to ensure the sequence?
2. for big files, each partition could still contain multiple lines, how to
ensure all the lines in each partition won't get messed up on consumer side?


I also want to know what is the best practice to process large volume of
data through kafka? There should be better way other than console command.

Thank you very much.



**
*Sincerely yours,*


*Raymond*


Re: Reliable way to purge data from Kafka topics

2018-05-25 Thread Shantanu Deshmukh
Hi Vincent,

We have ELK cluster in both primary and backup DC. So end goal of consumers
(Logstash) is to index logs in Elasticsearch and show them using Kibana. We
are replicating data in ELKs using mirror maker. It's not possible to
consume from both DCs at the same time as components which produce logs are
active only on one of the DCs.

If you say mirror maker is known to generate duplicates then what is other
reliable means of replication? Someone suggested Confluent Replicator.
However, it requires Confluent Kafka distro and we have Apache Kafka. We
can't change this infra at our current stage.

Thanks & Regards,

Shantanu Deshmukh

On Fri 25 May, 2018, 1:30 PM Vincent Maurin, 
wrote:

> What is the end results done by your consumers ?
> From what I understand, having the need for no duplicates means that these
> duplicates can show up somewhere ?
>
> According your needs, you can also have consumers in the two DC consuming
> from both. Then you don't have duplicate because a message is either
> produced on one cluster or the other.
> I would really avoid mirror makers here for this setup (it is the component
> creating the duplicates if you consume from both clusters at the end)
>
>
> On Fri, May 25, 2018 at 9:29 AM Shantanu Deshmukh 
> wrote:
>
> > Hi Vincent,
> >
> > Our producers are consumers are indeed local to Kafka cluster. When we
> > switch DC everything switches. So when we are on backup producers and
> > consumers on backup DC are active, everything on primary DC is stopped.
> >
> > Whatever data gets accumulated on backup DC needs to be reflected in
> > primary DC. That's when we start reverse replication. And to clean up
> data
> > replicated from primary to backup (before switch happened), we have to
> > purge topics on backup Kafka cluster. And that is the challenge.
> >
> > On Fri, May 25, 2018 at 12:40 PM Vincent Maurin <
> vincent.mau...@glispa.com
> > >
> > wrote:
> >
> > > Hi Shantanu
> > >
> > > I am not sure the scenario you are describing is the best case. I would
> > > more consider the problem in term of producers and consumers of the
> data.
> > > Usually is a good practice to put your producer local to your kafka
> > > cluster, so in your case, I would suggest you have producers in the
> main
> > > and in the backup data center / region.
> > > Then the question arise for your consumers and eventually your data
> > storage
> > > behing. If it is centralized in one place, in could be better to no use
> > > mirror maker and have duplication of the consumer.
> > >
> > > So something looking more like a star schema, let me try some ascii
> art :
> > >
> > > Main DC :Data storage/processing DC :
> > > Producer --> Kafka   |Consumer >  Data storage
> > >  |   /->
> > > Backup DC :  |  /
> > > Producer --> Kafka   |Consumer /
> > >
> > > If you have an outage on the main, the backup can "deplace it" (maybe
> > just
> > > with a DNS switch or similar)
> > > If you have an outage on your storage/processing part, messages will
> just
> > > be stored in kafka the time your consumers are up again (plan enough
> disk
> > > on kafka to conver your SLA)
> > >
> > > Best,
> > >
> > >
> > >
> > >
> > > On Fri, May 25, 2018 at 9:00 AM Jörn Franke 
> > wrote:
> > >
> > > > Purging will never prevent that it does not get replicated for sure.
> > > There
> > > > will be always a case (error to purge etc) and then it is still
> > > replicated.
> > > > You may reduce the probability but it will never be impossible.
> > > >
> > > > Your application should be able to handle duplicated messages.
> > > >
> > > > > On 25. May 2018, at 08:54, Shantanu Deshmukh <
> shantanu...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > We have cross data center replication. Using Kafka mirror maker we
> > are
> > > > > replicating data from our primary cluster to backup cluster.
> Problem
> > > > arises
> > > > > when we start operating from backup cluster, in case of drill or
> > actual
> > > > > outage. Data gathered at backup cluster needs to be
> > reverse-replicated
> > > to
> > > > > primary. To do that I can only think of two options. 1) Use a
> > different
> > > > CG
> > > > > every time for mirror maker 2) Purge topics so that data sent by
> > > primary
> > > > > doesn't get replicated back to primary again due to reverse
> > > replication.
> > > > >
> > > > > We have opted for purging Kafka topics which are under
> replication. I
> > > use
> > > > > kafka-topics.sh --alter command to set retention of topic to 5
> > seconds
> > > to
> > > > > purge data. But this doesn't see to be a fool proof mechanism.
> Thread
> > > > > responsible for doing this every minute, and even if it runs it's
> not
> > > > sure
> > > > > to work as there are multiple conditions. That, segment should be
> > full
> > > or
> > > > > certain time should have passed to roll a new segment. It so
> happened
> > > > > during one suc

Re: Unclear client-to-broker communication

2018-05-25 Thread chw
Could anyone please help?


Am 21.05.2018 um 10:56 schrieb chw:
> Hi everybody,
>
> the communication between the client and the broker is unclear to me.
> The documentation states:
>
>> The client initiates a socket connection and then writes a sequence of
>> request messages and reads back the corresponding response message. No
>> handshake is required on connection or disconnection.
> Does the client hold the TCP connection for its whole livecycle? That
> is, the client connects once to the broker and keeps the connection for
> all subsequent requests/messages (as opposed to a HTTP request)?
>
> As I know, the TCP requires a 3-way handshake to establish a connection.
> However, the documentation states that no handskake is required. Could
> anybody explain that point in more detail?
>
>> TCP is happier if you maintain persistent connections used for many
>> requests to amortize the cost of the TCP handshake, but beyond this
>> penalty connecting is pretty cheap.
> I do not understand what the purpose of this sentence is. On the one
> hand, TCP is explained a little. On the other hand, a justification
> concerning performance is made. But: none of this information helps the
> user. Should I, as user, ensure that a connection is maintained
> persistently or does Kafka do that for me?
>
> It would be great, if someone could update the documentation accordingly.
>
> Regards,
> Christian
>
>



Re: subscribe mail list

2018-05-25 Thread Matthias J. Sax
To subscribe, please follow instructions here:
https://kafka.apache.org/contact


On 5/24/18 8:16 PM,  wrote:
> subscribe mail list
> 



signature.asc
Description: OpenPGP digital signature


subscribe mail list

2018-05-25 Thread ????????????
subscribe mail list

Re: Reliable way to purge data from Kafka topics

2018-05-25 Thread Vincent Maurin
What is the end results done by your consumers ?
>From what I understand, having the need for no duplicates means that these
duplicates can show up somewhere ?

According your needs, you can also have consumers in the two DC consuming
from both. Then you don't have duplicate because a message is either
produced on one cluster or the other.
I would really avoid mirror makers here for this setup (it is the component
creating the duplicates if you consume from both clusters at the end)


On Fri, May 25, 2018 at 9:29 AM Shantanu Deshmukh 
wrote:

> Hi Vincent,
>
> Our producers are consumers are indeed local to Kafka cluster. When we
> switch DC everything switches. So when we are on backup producers and
> consumers on backup DC are active, everything on primary DC is stopped.
>
> Whatever data gets accumulated on backup DC needs to be reflected in
> primary DC. That's when we start reverse replication. And to clean up data
> replicated from primary to backup (before switch happened), we have to
> purge topics on backup Kafka cluster. And that is the challenge.
>
> On Fri, May 25, 2018 at 12:40 PM Vincent Maurin  >
> wrote:
>
> > Hi Shantanu
> >
> > I am not sure the scenario you are describing is the best case. I would
> > more consider the problem in term of producers and consumers of the data.
> > Usually is a good practice to put your producer local to your kafka
> > cluster, so in your case, I would suggest you have producers in the main
> > and in the backup data center / region.
> > Then the question arise for your consumers and eventually your data
> storage
> > behing. If it is centralized in one place, in could be better to no use
> > mirror maker and have duplication of the consumer.
> >
> > So something looking more like a star schema, let me try some ascii art :
> >
> > Main DC :Data storage/processing DC :
> > Producer --> Kafka   |Consumer >  Data storage
> >  |   /->
> > Backup DC :  |  /
> > Producer --> Kafka   |Consumer /
> >
> > If you have an outage on the main, the backup can "deplace it" (maybe
> just
> > with a DNS switch or similar)
> > If you have an outage on your storage/processing part, messages will just
> > be stored in kafka the time your consumers are up again (plan enough disk
> > on kafka to conver your SLA)
> >
> > Best,
> >
> >
> >
> >
> > On Fri, May 25, 2018 at 9:00 AM Jörn Franke 
> wrote:
> >
> > > Purging will never prevent that it does not get replicated for sure.
> > There
> > > will be always a case (error to purge etc) and then it is still
> > replicated.
> > > You may reduce the probability but it will never be impossible.
> > >
> > > Your application should be able to handle duplicated messages.
> > >
> > > > On 25. May 2018, at 08:54, Shantanu Deshmukh 
> > > wrote:
> > > >
> > > > Hello,
> > > >
> > > > We have cross data center replication. Using Kafka mirror maker we
> are
> > > > replicating data from our primary cluster to backup cluster. Problem
> > > arises
> > > > when we start operating from backup cluster, in case of drill or
> actual
> > > > outage. Data gathered at backup cluster needs to be
> reverse-replicated
> > to
> > > > primary. To do that I can only think of two options. 1) Use a
> different
> > > CG
> > > > every time for mirror maker 2) Purge topics so that data sent by
> > primary
> > > > doesn't get replicated back to primary again due to reverse
> > replication.
> > > >
> > > > We have opted for purging Kafka topics which are under replication. I
> > use
> > > > kafka-topics.sh --alter command to set retention of topic to 5
> seconds
> > to
> > > > purge data. But this doesn't see to be a fool proof mechanism. Thread
> > > > responsible for doing this every minute, and even if it runs it's not
> > > sure
> > > > to work as there are multiple conditions. That, segment should be
> full
> > or
> > > > certain time should have passed to roll a new segment. It so happened
> > > > during one such drill to move to backup cluster, purge command was
> > issued
> > > > and we waited for 5 minutes. Still data wasn't purged. Due to this we
> > > faced
> > > > data duplication when reverse replication started.
> > > >
> > > > Is there a better way to achieve this?
> > >
> >
>


Re: Reliable way to purge data from Kafka topics

2018-05-25 Thread Shantanu Deshmukh
Hi Vincent,

Our producers are consumers are indeed local to Kafka cluster. When we
switch DC everything switches. So when we are on backup producers and
consumers on backup DC are active, everything on primary DC is stopped.

Whatever data gets accumulated on backup DC needs to be reflected in
primary DC. That's when we start reverse replication. And to clean up data
replicated from primary to backup (before switch happened), we have to
purge topics on backup Kafka cluster. And that is the challenge.

On Fri, May 25, 2018 at 12:40 PM Vincent Maurin 
wrote:

> Hi Shantanu
>
> I am not sure the scenario you are describing is the best case. I would
> more consider the problem in term of producers and consumers of the data.
> Usually is a good practice to put your producer local to your kafka
> cluster, so in your case, I would suggest you have producers in the main
> and in the backup data center / region.
> Then the question arise for your consumers and eventually your data storage
> behing. If it is centralized in one place, in could be better to no use
> mirror maker and have duplication of the consumer.
>
> So something looking more like a star schema, let me try some ascii art :
>
> Main DC :Data storage/processing DC :
> Producer --> Kafka   |Consumer >  Data storage
>  |   /->
> Backup DC :  |  /
> Producer --> Kafka   |Consumer /
>
> If you have an outage on the main, the backup can "deplace it" (maybe just
> with a DNS switch or similar)
> If you have an outage on your storage/processing part, messages will just
> be stored in kafka the time your consumers are up again (plan enough disk
> on kafka to conver your SLA)
>
> Best,
>
>
>
>
> On Fri, May 25, 2018 at 9:00 AM Jörn Franke  wrote:
>
> > Purging will never prevent that it does not get replicated for sure.
> There
> > will be always a case (error to purge etc) and then it is still
> replicated.
> > You may reduce the probability but it will never be impossible.
> >
> > Your application should be able to handle duplicated messages.
> >
> > > On 25. May 2018, at 08:54, Shantanu Deshmukh 
> > wrote:
> > >
> > > Hello,
> > >
> > > We have cross data center replication. Using Kafka mirror maker we are
> > > replicating data from our primary cluster to backup cluster. Problem
> > arises
> > > when we start operating from backup cluster, in case of drill or actual
> > > outage. Data gathered at backup cluster needs to be reverse-replicated
> to
> > > primary. To do that I can only think of two options. 1) Use a different
> > CG
> > > every time for mirror maker 2) Purge topics so that data sent by
> primary
> > > doesn't get replicated back to primary again due to reverse
> replication.
> > >
> > > We have opted for purging Kafka topics which are under replication. I
> use
> > > kafka-topics.sh --alter command to set retention of topic to 5 seconds
> to
> > > purge data. But this doesn't see to be a fool proof mechanism. Thread
> > > responsible for doing this every minute, and even if it runs it's not
> > sure
> > > to work as there are multiple conditions. That, segment should be full
> or
> > > certain time should have passed to roll a new segment. It so happened
> > > during one such drill to move to backup cluster, purge command was
> issued
> > > and we waited for 5 minutes. Still data wasn't purged. Due to this we
> > faced
> > > data duplication when reverse replication started.
> > >
> > > Is there a better way to achieve this?
> >
>


Re: Reliable way to purge data from Kafka topics

2018-05-25 Thread Vincent Maurin
Hi Shantanu

I am not sure the scenario you are describing is the best case. I would
more consider the problem in term of producers and consumers of the data.
Usually is a good practice to put your producer local to your kafka
cluster, so in your case, I would suggest you have producers in the main
and in the backup data center / region.
Then the question arise for your consumers and eventually your data storage
behing. If it is centralized in one place, in could be better to no use
mirror maker and have duplication of the consumer.

So something looking more like a star schema, let me try some ascii art :

Main DC :Data storage/processing DC :
Producer --> Kafka   |Consumer >  Data storage
 |   /->
Backup DC :  |  /
Producer --> Kafka   |Consumer /

If you have an outage on the main, the backup can "deplace it" (maybe just
with a DNS switch or similar)
If you have an outage on your storage/processing part, messages will just
be stored in kafka the time your consumers are up again (plan enough disk
on kafka to conver your SLA)

Best,




On Fri, May 25, 2018 at 9:00 AM Jörn Franke  wrote:

> Purging will never prevent that it does not get replicated for sure. There
> will be always a case (error to purge etc) and then it is still replicated.
> You may reduce the probability but it will never be impossible.
>
> Your application should be able to handle duplicated messages.
>
> > On 25. May 2018, at 08:54, Shantanu Deshmukh 
> wrote:
> >
> > Hello,
> >
> > We have cross data center replication. Using Kafka mirror maker we are
> > replicating data from our primary cluster to backup cluster. Problem
> arises
> > when we start operating from backup cluster, in case of drill or actual
> > outage. Data gathered at backup cluster needs to be reverse-replicated to
> > primary. To do that I can only think of two options. 1) Use a different
> CG
> > every time for mirror maker 2) Purge topics so that data sent by primary
> > doesn't get replicated back to primary again due to reverse replication.
> >
> > We have opted for purging Kafka topics which are under replication. I use
> > kafka-topics.sh --alter command to set retention of topic to 5 seconds to
> > purge data. But this doesn't see to be a fool proof mechanism. Thread
> > responsible for doing this every minute, and even if it runs it's not
> sure
> > to work as there are multiple conditions. That, segment should be full or
> > certain time should have passed to roll a new segment. It so happened
> > during one such drill to move to backup cluster, purge command was issued
> > and we waited for 5 minutes. Still data wasn't purged. Due to this we
> faced
> > data duplication when reverse replication started.
> >
> > Is there a better way to achieve this?
>


Re: Reliable way to purge data from Kafka topics

2018-05-25 Thread Jörn Franke
Purging will never prevent that it does not get replicated for sure. There will 
be always a case (error to purge etc) and then it is still replicated. You may 
reduce the probability but it will never be impossible. 

Your application should be able to handle duplicated messages.

> On 25. May 2018, at 08:54, Shantanu Deshmukh  wrote:
> 
> Hello,
> 
> We have cross data center replication. Using Kafka mirror maker we are
> replicating data from our primary cluster to backup cluster. Problem arises
> when we start operating from backup cluster, in case of drill or actual
> outage. Data gathered at backup cluster needs to be reverse-replicated to
> primary. To do that I can only think of two options. 1) Use a different CG
> every time for mirror maker 2) Purge topics so that data sent by primary
> doesn't get replicated back to primary again due to reverse replication.
> 
> We have opted for purging Kafka topics which are under replication. I use
> kafka-topics.sh --alter command to set retention of topic to 5 seconds to
> purge data. But this doesn't see to be a fool proof mechanism. Thread
> responsible for doing this every minute, and even if it runs it's not sure
> to work as there are multiple conditions. That, segment should be full or
> certain time should have passed to roll a new segment. It so happened
> during one such drill to move to backup cluster, purge command was issued
> and we waited for 5 minutes. Still data wasn't purged. Due to this we faced
> data duplication when reverse replication started.
> 
> Is there a better way to achieve this?