Re: Mirror Maker 2.0 Queries

2020-08-07 Thread Ananya Sen
Thank you Ryanne for the quick response. 
I further want to clarify a few points.

The mirror maker 2.0 is based on the Kafka Connect framework. In Kafka connect 
we have multiple workers and each worker has some assigned task. To map this to 
Mirror Maker 2.0, A mirror Maker will driver have some workers. 

1) Can this number of workers be configured? 
2) What is the default value of this worker configuration? 
3) Does every topic partition given a new task?
4) Does every consumer group - topic pair given a new task for replicating 
offset?

Also, consider a case where I have 1000 topics in a Kafka cluster and each 
topic has a high amount of data + new data is being written at high throughput. 
Now I want to set up a mirror maker 2.0 on this cluster to replicate all the 
old data (which is retained in the topic) as well as the new incoming data in a 
backup cluster. How can I scale up the mirror maker instance so that I can have 
very little lag? 

On 2020/07/11 06:37:56, Ananya Sen  wrote: 
> Hi
> 
> I was exploring the Mirror maker 2.0. I read through this
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
> documentation
> and I have  a few questions.
> 
>1. For running mirror maker as a dedicated mirror maker cluster, the
>documentation specifies a config file and a starter script. Is this mirror
>maker process distributed ?
>2. I could not find any port configuration for the above mirror maker
>process, So can we configure mirror maker itself to run as a cluster i.e
>running the process instance across multiple server to avoid downtime due
>to server crash.
>3. If we could somehow run the mirror maker as a distributed process
>then does that mean that topic and consumer offset replication will be
>shared among those mirror maker processes?
>4. What is the default port of this mirror maker process and how can we
>override it?
> 
> Looking forward to your reply.
> 
> 
> Thanks & Regards
> Ananya Sen
> 


Re: Kafka topic partition distributing evenly on disks

2020-08-07 Thread Manoj.Agrawal2
Or manually you can move data dir  . I'm assuming you have  replica >1
Stop the kafka process on broker 1
Move 1 or 2  dir log from Disk 1 to disk 2
And start the kafka process

Wait for ISR sync

Then you can repeate this step again .

On 8/7/20, 6:45 AM, "William Reynolds"  
wrote:

[External]


Hmm, that's odd, I am sure it was in the docs previously. Here is the
KIP on it 
https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-113%253A%2BSupport%2Breplicas%2Bmovement%2Bbetween%2Blog%2Bdirectories&data=02%7C01%7CManoj.Agrawal2%40cognizant.com%7C3c313758d6c44da817ac08d83ad8262f%7Cde08c40719b9427d9fe8edf254300ca7%7C0%7C0%7C637324047321152477&sdata=tTYPxMp%2FmZ9ufSXQqY%2FbDwAIAG4ZNxRrc7fFq3EEvSg%3D&reserved=0
Basically the reassignment json that you get looks like this from the
initial generation and if you already have a realignment file you can
just add the log dirs section to each partition entry

{
  "version" : int,
  "partitions" : [
{
  "topic" : str,
  "partition" : int,
  "replicas" : [int],
  "log_dirs" : [str]<-- NEW. A log directory can be either
"any", or a valid absolute path that begins with '/'. This is an
optional filed. It is treated as an array of "any" if this field is
not explicitly specified in the json file.
},
...
  ]
}

Hope that helps
William

On 07/08/2020, Péter Nagykátai  wrote:
> Thank you William,
>
> I checked the doc and don't see any instructions regarding disks. Should I
> simply "move around" the topics and Kafka will assign the topics evenly on
> the two disks (per broker)? The current setup looks like this (for the
> topic in question, 15 primary, replica partitions):
>
> Broker 1 - disk 1: 8 partition
> Broker 1 - disk 2: 2 partition
>
> Broker 2 - disk 1: 8 partition
> Broker 2 - disk 2: 2 partition
>
> Broker 3 - disk 1: 8 partition
> Broker 3 - disk 2: 2 partition
>
> Thanks!
>
> On Fri, Aug 7, 2020 at 1:01 PM William Reynolds <
> william.reyno...@instaclustr.com> wrote:
>
>> Hi Péter,
>> Sounds like time to reassign the partitions you have across all the
>> brokers/data dirs using the instructions from here
>> 
https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkafka.apache.org%2Fdocumentation%2F%23basic_ops_automigrate&data=02%7C01%7CManoj.Agrawal2%40cognizant.com%7C3c313758d6c44da817ac08d83ad8262f%7Cde08c40719b9427d9fe8edf254300ca7%7C0%7C0%7C637324047321162468&sdata=yWH5xhV8GXsTAOubFU1QmkuMlChpx6DVNe%2BKPpe5bwk%3D&reserved=0.
 That
>> assumes that your partition strategy has somewhat evenly filled your
>> partitions and given it may move all the partitions it could be a bit
>> intensive so be sure to use the throttle option.
>> Cheers
>> William
>>
>> On 07/08/2020, Péter Nagykátai  wrote:
>> > Hello everybody,
>> >
>> > Thank you for the detailed answers. My issue is partly answered here:
>> >
>> >
>> >
>> >
>> > *This rule also applies to disk-level, which means that when a set
>> > ofpartitions assigned to a specific broker, each of the disks will get
>> > thesame number of partitions without considering the load of disks at
>> > thattime.*
>> >
>> >  I admit, I didn't provide enough info either.
>> >
>> > So my problem is that an existing topic got a huge surge of events for
>> this
>> > week. I knew that'll happen and I modified the partition count.
>> > Unfortunately, it occurred to me a bit later, that I'll likely need
>> > some
>> > extra disk space. So I added an extra disk to each broker. The thing I
>> > didn't know, that Kafka won't evenly distribute the partitions on the
>> > disks.
>> > So the question still remains:
>> >  Is there any way to have Kafka evenly distribute data on its disks?
>> > Also, what options do I have *after *I'm in the situation I described
>> > above? (preferably without deleting the topic)
>> >
>> > Thanks!
>> >
>> > On Fri, Aug 7, 2020 at 12:00 PM Yingshuan Song
>> > 
>> > wrote:
>> >
>> >> Hi Peter,
>> >> Agreed with Manoj and Vinicius, i think those rules led to this result
>> >> :
>> >>
>> >> 1)the partitions of a topic - N and replication number - R determine
>> >> the
>> >> real partition-replica count of this topic, which is N * R;
>> >> 2)   kafka can distribute partitions evenly among brokers, but it is
>> >> based
>> >> on the broker count when the topic was created, this is important.
>> >> If we create a topic (N - 4, R - 3) in a kafka cluster which contains
>> >> 3
>> >> kafka brokers, then 4 * 3 / 3 = 4 partitions will be assigned to each
>> >> broker.
>> >> But if a new broker was added int

anonymization of Avro messages

2020-08-07 Thread Dumitru-Nicolae Marasoui
Hello kafka community,
We have Avro topics (with Avro values).
We think of annonymizing the data to use it on UAT.
Do you know of any tools that can be used or bought?
Thanks,

-- 

Dumitru-Nicolae Marasoui

Software Engineer



w kaluza.com 

LinkedIn  | Twitter


Kaluza Ltd. registered in England and Wales No. 08785057

VAT No. 100119879

Help save paper - do you need to print this email?


Re: Kafka topic partition distributing evenly on disks

2020-08-07 Thread William Reynolds
Hmm, that's odd, I am sure it was in the docs previously. Here is the
KIP on it 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%3A+Support+replicas+movement+between+log+directories
Basically the reassignment json that you get looks like this from the
initial generation and if you already have a realignment file you can
just add the log dirs section to each partition entry

{
  "version" : int,
  "partitions" : [
{
  "topic" : str,
  "partition" : int,
  "replicas" : [int],
  "log_dirs" : [str]<-- NEW. A log directory can be either
"any", or a valid absolute path that begins with '/'. This is an
optional filed. It is treated as an array of "any" if this field is
not explicitly specified in the json file.
},
...
  ]
}

Hope that helps
William

On 07/08/2020, Péter Nagykátai  wrote:
> Thank you William,
>
> I checked the doc and don't see any instructions regarding disks. Should I
> simply "move around" the topics and Kafka will assign the topics evenly on
> the two disks (per broker)? The current setup looks like this (for the
> topic in question, 15 primary, replica partitions):
>
> Broker 1 - disk 1: 8 partition
> Broker 1 - disk 2: 2 partition
>
> Broker 2 - disk 1: 8 partition
> Broker 2 - disk 2: 2 partition
>
> Broker 3 - disk 1: 8 partition
> Broker 3 - disk 2: 2 partition
>
> Thanks!
>
> On Fri, Aug 7, 2020 at 1:01 PM William Reynolds <
> william.reyno...@instaclustr.com> wrote:
>
>> Hi Péter,
>> Sounds like time to reassign the partitions you have across all the
>> brokers/data dirs using the instructions from here
>> https://kafka.apache.org/documentation/#basic_ops_automigrate. That
>> assumes that your partition strategy has somewhat evenly filled your
>> partitions and given it may move all the partitions it could be a bit
>> intensive so be sure to use the throttle option.
>> Cheers
>> William
>>
>> On 07/08/2020, Péter Nagykátai  wrote:
>> > Hello everybody,
>> >
>> > Thank you for the detailed answers. My issue is partly answered here:
>> >
>> >
>> >
>> >
>> > *This rule also applies to disk-level, which means that when a set
>> > ofpartitions assigned to a specific broker, each of the disks will get
>> > thesame number of partitions without considering the load of disks at
>> > thattime.*
>> >
>> >  I admit, I didn't provide enough info either.
>> >
>> > So my problem is that an existing topic got a huge surge of events for
>> this
>> > week. I knew that'll happen and I modified the partition count.
>> > Unfortunately, it occurred to me a bit later, that I'll likely need
>> > some
>> > extra disk space. So I added an extra disk to each broker. The thing I
>> > didn't know, that Kafka won't evenly distribute the partitions on the
>> > disks.
>> > So the question still remains:
>> >  Is there any way to have Kafka evenly distribute data on its disks?
>> > Also, what options do I have *after *I'm in the situation I described
>> > above? (preferably without deleting the topic)
>> >
>> > Thanks!
>> >
>> > On Fri, Aug 7, 2020 at 12:00 PM Yingshuan Song
>> > 
>> > wrote:
>> >
>> >> Hi Peter,
>> >> Agreed with Manoj and Vinicius, i think those rules led to this result
>> >> :
>> >>
>> >> 1)the partitions of a topic - N and replication number - R determine
>> >> the
>> >> real partition-replica count of this topic, which is N * R;
>> >> 2)   kafka can distribute partitions evenly among brokers, but it is
>> >> based
>> >> on the broker count when the topic was created, this is important.
>> >> If we create a topic (N - 4, R - 3) in a kafka cluster which contains
>> >> 3
>> >> kafka brokers, then 4 * 3 / 3 = 4 partitions will be assigned to each
>> >> broker.
>> >> But if a new broker was added into this cluster and another topic (N -
>> 4,
>> >> R
>> >> - 3) need to be created, then 4 * 3 / 4 = 3 partitions will be
>> >> assigned
>> >> to
>> >> each broker.
>> >> Kafka will not assign all those partitions to the new added broker
>> >> even
>> >> though it is idle and i think this is a shortcoming of kafka.
>> >> This rule also applies to disk-level, which means that when a set of
>> >> partitions assigned to a specific broker, each of the disks will get
>> >> the
>> >> same number of partitions without considering the load of disks at
>> >> that
>> >> time.
>> >> 3) when producer send records to topics, how to chose partiton : 3-1)
>> >> if
>> >> a
>> >> record has a key, then the partition number calculate according to the
>> >> key;
>> >> 3-2) if  records have no keys, then those records will be sent to each
>> >> partition in turns. So, if there are lots of records with the same
>> >> key,
>> >> and
>> >> those records will be sent to the same partition, and may take up a
>> >> lot
>> >> of
>> >> disk space.
>> >>
>> >>
>> >> hope this helps
>> >>
>> >> Vinicius Scheidegger  于2020年8月7日周五
>> >> 上午6:10写道:
>> >>
>> >> > Hi Peter,
>> >> >
>> >> > AFAIK, everything depends on:
>> >> >
>> >> > 1) How you have configured your topic
>> >> >   a) number of partiti

Re: Kafka topic partition distributing evenly on disks

2020-08-07 Thread Péter Nagykátai
Thank you William,

I checked the doc and don't see any instructions regarding disks. Should I
simply "move around" the topics and Kafka will assign the topics evenly on
the two disks (per broker)? The current setup looks like this (for the
topic in question, 15 primary, replica partitions):

Broker 1 - disk 1: 8 partition
Broker 1 - disk 2: 2 partition

Broker 2 - disk 1: 8 partition
Broker 2 - disk 2: 2 partition

Broker 3 - disk 1: 8 partition
Broker 3 - disk 2: 2 partition

Thanks!

On Fri, Aug 7, 2020 at 1:01 PM William Reynolds <
william.reyno...@instaclustr.com> wrote:

> Hi Péter,
> Sounds like time to reassign the partitions you have across all the
> brokers/data dirs using the instructions from here
> https://kafka.apache.org/documentation/#basic_ops_automigrate. That
> assumes that your partition strategy has somewhat evenly filled your
> partitions and given it may move all the partitions it could be a bit
> intensive so be sure to use the throttle option.
> Cheers
> William
>
> On 07/08/2020, Péter Nagykátai  wrote:
> > Hello everybody,
> >
> > Thank you for the detailed answers. My issue is partly answered here:
> >
> >
> >
> >
> > *This rule also applies to disk-level, which means that when a set
> > ofpartitions assigned to a specific broker, each of the disks will get
> > thesame number of partitions without considering the load of disks at
> > thattime.*
> >
> >  I admit, I didn't provide enough info either.
> >
> > So my problem is that an existing topic got a huge surge of events for
> this
> > week. I knew that'll happen and I modified the partition count.
> > Unfortunately, it occurred to me a bit later, that I'll likely need some
> > extra disk space. So I added an extra disk to each broker. The thing I
> > didn't know, that Kafka won't evenly distribute the partitions on the
> > disks.
> > So the question still remains:
> >  Is there any way to have Kafka evenly distribute data on its disks?
> > Also, what options do I have *after *I'm in the situation I described
> > above? (preferably without deleting the topic)
> >
> > Thanks!
> >
> > On Fri, Aug 7, 2020 at 12:00 PM Yingshuan Song 
> > wrote:
> >
> >> Hi Peter,
> >> Agreed with Manoj and Vinicius, i think those rules led to this result :
> >>
> >> 1)the partitions of a topic - N and replication number - R determine the
> >> real partition-replica count of this topic, which is N * R;
> >> 2)   kafka can distribute partitions evenly among brokers, but it is
> >> based
> >> on the broker count when the topic was created, this is important.
> >> If we create a topic (N - 4, R - 3) in a kafka cluster which contains 3
> >> kafka brokers, then 4 * 3 / 3 = 4 partitions will be assigned to each
> >> broker.
> >> But if a new broker was added into this cluster and another topic (N -
> 4,
> >> R
> >> - 3) need to be created, then 4 * 3 / 4 = 3 partitions will be assigned
> >> to
> >> each broker.
> >> Kafka will not assign all those partitions to the new added broker even
> >> though it is idle and i think this is a shortcoming of kafka.
> >> This rule also applies to disk-level, which means that when a set of
> >> partitions assigned to a specific broker, each of the disks will get the
> >> same number of partitions without considering the load of disks at that
> >> time.
> >> 3) when producer send records to topics, how to chose partiton : 3-1) if
> >> a
> >> record has a key, then the partition number calculate according to the
> >> key;
> >> 3-2) if  records have no keys, then those records will be sent to each
> >> partition in turns. So, if there are lots of records with the same key,
> >> and
> >> those records will be sent to the same partition, and may take up a lot
> >> of
> >> disk space.
> >>
> >>
> >> hope this helps
> >>
> >> Vinicius Scheidegger  于2020年8月7日周五
> >> 上午6:10写道:
> >>
> >> > Hi Peter,
> >> >
> >> > AFAIK, everything depends on:
> >> >
> >> > 1) How you have configured your topic
> >> >   a) number of partitions (here I understand you have 15 partitions)
> >> >   b) partition replication configuration (each partition necessarily
> >> > has
> >> a
> >> > leader - primary responsible to hold the data - and for reads and
> >> > writes)
> >> > you can configure the topic to have a number of replicas
> >> > 2) How you publish messages to the topic
> >> >   a) The publisher is responsible to choose the partition. This can be
> >> done
> >> > consciously (by setting the partition id while sending the message to
> >> > the
> >> > topic) or unconsciously (by using the DefaultPartitioner or any other
> >> > partitioner scheme).
> >> >
> >> > All messages sent to a specific partition will be written first to the
> >> > leader (meaning that the disk configured for the partition leader will
> >> > receive the load) and then replicated to the replica (followers).
> >> > Kafka does not automatically distribute the data equally to the
> >> > different
> >> > brokers - you need to think about your architecture having that in
> >> > 

Re: Kafka topic partition distributing evenly on disks

2020-08-07 Thread William Reynolds
Hi Péter,
Sounds like time to reassign the partitions you have across all the
brokers/data dirs using the instructions from here
https://kafka.apache.org/documentation/#basic_ops_automigrate. That
assumes that your partition strategy has somewhat evenly filled your
partitions and given it may move all the partitions it could be a bit
intensive so be sure to use the throttle option.
Cheers
William

On 07/08/2020, Péter Nagykátai  wrote:
> Hello everybody,
>
> Thank you for the detailed answers. My issue is partly answered here:
>
>
>
>
> *This rule also applies to disk-level, which means that when a set
> ofpartitions assigned to a specific broker, each of the disks will get
> thesame number of partitions without considering the load of disks at
> thattime.*
>
>  I admit, I didn't provide enough info either.
>
> So my problem is that an existing topic got a huge surge of events for this
> week. I knew that'll happen and I modified the partition count.
> Unfortunately, it occurred to me a bit later, that I'll likely need some
> extra disk space. So I added an extra disk to each broker. The thing I
> didn't know, that Kafka won't evenly distribute the partitions on the
> disks.
> So the question still remains:
>  Is there any way to have Kafka evenly distribute data on its disks?
> Also, what options do I have *after *I'm in the situation I described
> above? (preferably without deleting the topic)
>
> Thanks!
>
> On Fri, Aug 7, 2020 at 12:00 PM Yingshuan Song 
> wrote:
>
>> Hi Peter,
>> Agreed with Manoj and Vinicius, i think those rules led to this result :
>>
>> 1)the partitions of a topic - N and replication number - R determine the
>> real partition-replica count of this topic, which is N * R;
>> 2)   kafka can distribute partitions evenly among brokers, but it is
>> based
>> on the broker count when the topic was created, this is important.
>> If we create a topic (N - 4, R - 3) in a kafka cluster which contains 3
>> kafka brokers, then 4 * 3 / 3 = 4 partitions will be assigned to each
>> broker.
>> But if a new broker was added into this cluster and another topic (N - 4,
>> R
>> - 3) need to be created, then 4 * 3 / 4 = 3 partitions will be assigned
>> to
>> each broker.
>> Kafka will not assign all those partitions to the new added broker even
>> though it is idle and i think this is a shortcoming of kafka.
>> This rule also applies to disk-level, which means that when a set of
>> partitions assigned to a specific broker, each of the disks will get the
>> same number of partitions without considering the load of disks at that
>> time.
>> 3) when producer send records to topics, how to chose partiton : 3-1) if
>> a
>> record has a key, then the partition number calculate according to the
>> key;
>> 3-2) if  records have no keys, then those records will be sent to each
>> partition in turns. So, if there are lots of records with the same key,
>> and
>> those records will be sent to the same partition, and may take up a lot
>> of
>> disk space.
>>
>>
>> hope this helps
>>
>> Vinicius Scheidegger  于2020年8月7日周五
>> 上午6:10写道:
>>
>> > Hi Peter,
>> >
>> > AFAIK, everything depends on:
>> >
>> > 1) How you have configured your topic
>> >   a) number of partitions (here I understand you have 15 partitions)
>> >   b) partition replication configuration (each partition necessarily
>> > has
>> a
>> > leader - primary responsible to hold the data - and for reads and
>> > writes)
>> > you can configure the topic to have a number of replicas
>> > 2) How you publish messages to the topic
>> >   a) The publisher is responsible to choose the partition. This can be
>> done
>> > consciously (by setting the partition id while sending the message to
>> > the
>> > topic) or unconsciously (by using the DefaultPartitioner or any other
>> > partitioner scheme).
>> >
>> > All messages sent to a specific partition will be written first to the
>> > leader (meaning that the disk configured for the partition leader will
>> > receive the load) and then replicated to the replica (followers).
>> > Kafka does not automatically distribute the data equally to the
>> > different
>> > brokers - you need to think about your architecture having that in
>> > mind.
>> >
>> > I hope it helps
>> >
>> > On Thu, Aug 6, 2020 at 10:23 PM Péter Nagykátai 
>> > wrote:
>> >
>> > > I initially started with one data disk (mounted solely to hold Kafka
>> > data)
>> > > and recently added a new one.
>> > >
>> > > On Thu, Aug 6, 2020 at 10:13 PM  wrote:
>> > >
>> > > > What do you mean older disk ?
>> > > >
>> > > > On 8/6/20, 12:05 PM, "Péter Nagykátai" 
>> wrote:
>> > > >
>> > > > [External]
>> > > >
>> > > >
>> > > > Yeah, but it doesn't do that. My "older" disks have ~70
>> partitions,
>> > > the
>> > > > newer ones ~5 partitions. That's why I'm asking what went
>> > > > wrong.
>> > > >
>> > > > On Thu, Aug 6, 2020 at 8:35 PM 
>> > wrote:
>> > > >
>> > > > > Kafka  evenly distributed number of partition on each disk so
>> in
>> > > > 

Re: Kafka topic partition distributing evenly on disks

2020-08-07 Thread Péter Nagykátai
Hello everybody,

Thank you for the detailed answers. My issue is partly answered here:




*This rule also applies to disk-level, which means that when a set
ofpartitions assigned to a specific broker, each of the disks will get
thesame number of partitions without considering the load of disks at
thattime.*

 I admit, I didn't provide enough info either.

So my problem is that an existing topic got a huge surge of events for this
week. I knew that'll happen and I modified the partition count.
Unfortunately, it occurred to me a bit later, that I'll likely need some
extra disk space. So I added an extra disk to each broker. The thing I
didn't know, that Kafka won't evenly distribute the partitions on the disks.
So the question still remains:
 Is there any way to have Kafka evenly distribute data on its disks?
Also, what options do I have *after *I'm in the situation I described
above? (preferably without deleting the topic)

Thanks!

On Fri, Aug 7, 2020 at 12:00 PM Yingshuan Song 
wrote:

> Hi Peter,
> Agreed with Manoj and Vinicius, i think those rules led to this result :
>
> 1)the partitions of a topic - N and replication number - R determine the
> real partition-replica count of this topic, which is N * R;
> 2)   kafka can distribute partitions evenly among brokers, but it is based
> on the broker count when the topic was created, this is important.
> If we create a topic (N - 4, R - 3) in a kafka cluster which contains 3
> kafka brokers, then 4 * 3 / 3 = 4 partitions will be assigned to each
> broker.
> But if a new broker was added into this cluster and another topic (N - 4, R
> - 3) need to be created, then 4 * 3 / 4 = 3 partitions will be assigned to
> each broker.
> Kafka will not assign all those partitions to the new added broker even
> though it is idle and i think this is a shortcoming of kafka.
> This rule also applies to disk-level, which means that when a set of
> partitions assigned to a specific broker, each of the disks will get the
> same number of partitions without considering the load of disks at that
> time.
> 3) when producer send records to topics, how to chose partiton : 3-1) if a
> record has a key, then the partition number calculate according to the key;
> 3-2) if  records have no keys, then those records will be sent to each
> partition in turns. So, if there are lots of records with the same key, and
> those records will be sent to the same partition, and may take up a lot of
> disk space.
>
>
> hope this helps
>
> Vinicius Scheidegger  于2020年8月7日周五
> 上午6:10写道:
>
> > Hi Peter,
> >
> > AFAIK, everything depends on:
> >
> > 1) How you have configured your topic
> >   a) number of partitions (here I understand you have 15 partitions)
> >   b) partition replication configuration (each partition necessarily has
> a
> > leader - primary responsible to hold the data - and for reads and writes)
> > you can configure the topic to have a number of replicas
> > 2) How you publish messages to the topic
> >   a) The publisher is responsible to choose the partition. This can be
> done
> > consciously (by setting the partition id while sending the message to the
> > topic) or unconsciously (by using the DefaultPartitioner or any other
> > partitioner scheme).
> >
> > All messages sent to a specific partition will be written first to the
> > leader (meaning that the disk configured for the partition leader will
> > receive the load) and then replicated to the replica (followers).
> > Kafka does not automatically distribute the data equally to the different
> > brokers - you need to think about your architecture having that in mind.
> >
> > I hope it helps
> >
> > On Thu, Aug 6, 2020 at 10:23 PM Péter Nagykátai 
> > wrote:
> >
> > > I initially started with one data disk (mounted solely to hold Kafka
> > data)
> > > and recently added a new one.
> > >
> > > On Thu, Aug 6, 2020 at 10:13 PM  wrote:
> > >
> > > > What do you mean older disk ?
> > > >
> > > > On 8/6/20, 12:05 PM, "Péter Nagykátai" 
> wrote:
> > > >
> > > > [External]
> > > >
> > > >
> > > > Yeah, but it doesn't do that. My "older" disks have ~70
> partitions,
> > > the
> > > > newer ones ~5 partitions. That's why I'm asking what went wrong.
> > > >
> > > > On Thu, Aug 6, 2020 at 8:35 PM 
> > wrote:
> > > >
> > > > > Kafka  evenly distributed number of partition on each disk so
> in
> > > > your case
> > > > > every disk should have 3/2 topic partitions .
> > > > > It is producer job to evenly produce data by partition key  to
> > > topic
> > > > > partition .
> > > > > How it partition key , it is auto generated or producer sending
> > key
> > > > along
> > > > > with message .
> > > > >
> > > > >
> > > > > On 8/6/20, 7:29 AM, "Péter Nagykátai" 
> > > wrote:
> > > > >
> > > > > [External]
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > I have a Kafka cluster with 3 brokers (v2.3.0) and each
> > broker
> > > > has 2
> > > > >

Re: Kafka topic partition distributing evenly on disks

2020-08-07 Thread Yingshuan Song
Hi Peter,
Agreed with Manoj and Vinicius, i think those rules led to this result :

1)the partitions of a topic - N and replication number - R determine the
real partition-replica count of this topic, which is N * R;
2)   kafka can distribute partitions evenly among brokers, but it is based
on the broker count when the topic was created, this is important.
If we create a topic (N - 4, R - 3) in a kafka cluster which contains 3
kafka brokers, then 4 * 3 / 3 = 4 partitions will be assigned to each
broker.
But if a new broker was added into this cluster and another topic (N - 4, R
- 3) need to be created, then 4 * 3 / 4 = 3 partitions will be assigned to
each broker.
Kafka will not assign all those partitions to the new added broker even
though it is idle and i think this is a shortcoming of kafka.
This rule also applies to disk-level, which means that when a set of
partitions assigned to a specific broker, each of the disks will get the
same number of partitions without considering the load of disks at that
time.
3) when producer send records to topics, how to chose partiton : 3-1) if a
record has a key, then the partition number calculate according to the key;
3-2) if  records have no keys, then those records will be sent to each
partition in turns. So, if there are lots of records with the same key, and
those records will be sent to the same partition, and may take up a lot of
disk space.


hope this helps

Vinicius Scheidegger  于2020年8月7日周五 上午6:10写道:

> Hi Peter,
>
> AFAIK, everything depends on:
>
> 1) How you have configured your topic
>   a) number of partitions (here I understand you have 15 partitions)
>   b) partition replication configuration (each partition necessarily has a
> leader - primary responsible to hold the data - and for reads and writes)
> you can configure the topic to have a number of replicas
> 2) How you publish messages to the topic
>   a) The publisher is responsible to choose the partition. This can be done
> consciously (by setting the partition id while sending the message to the
> topic) or unconsciously (by using the DefaultPartitioner or any other
> partitioner scheme).
>
> All messages sent to a specific partition will be written first to the
> leader (meaning that the disk configured for the partition leader will
> receive the load) and then replicated to the replica (followers).
> Kafka does not automatically distribute the data equally to the different
> brokers - you need to think about your architecture having that in mind.
>
> I hope it helps
>
> On Thu, Aug 6, 2020 at 10:23 PM Péter Nagykátai 
> wrote:
>
> > I initially started with one data disk (mounted solely to hold Kafka
> data)
> > and recently added a new one.
> >
> > On Thu, Aug 6, 2020 at 10:13 PM  wrote:
> >
> > > What do you mean older disk ?
> > >
> > > On 8/6/20, 12:05 PM, "Péter Nagykátai"  wrote:
> > >
> > > [External]
> > >
> > >
> > > Yeah, but it doesn't do that. My "older" disks have ~70 partitions,
> > the
> > > newer ones ~5 partitions. That's why I'm asking what went wrong.
> > >
> > > On Thu, Aug 6, 2020 at 8:35 PM 
> wrote:
> > >
> > > > Kafka  evenly distributed number of partition on each disk so in
> > > your case
> > > > every disk should have 3/2 topic partitions .
> > > > It is producer job to evenly produce data by partition key  to
> > topic
> > > > partition .
> > > > How it partition key , it is auto generated or producer sending
> key
> > > along
> > > > with message .
> > > >
> > > >
> > > > On 8/6/20, 7:29 AM, "Péter Nagykátai" 
> > wrote:
> > > >
> > > > [External]
> > > >
> > > >
> > > > Hello,
> > > >
> > > > I have a Kafka cluster with 3 brokers (v2.3.0) and each
> broker
> > > has 2
> > > > disks
> > > > attached. I added a new topic (heavyweight) and was surprised
> > > that
> > > > even if
> > > > the topic has 15 partitions, those weren't distributed evenly
> > on
> > > the
> > > > disks.
> > > > Thus I got one disk that's almost empty and the other almost
> > > filled
> > > > up. Is
> > > > there any way to have Kafka evenly distribute data on its
> > disks?
> > > >
> > > > Thank you!
> > > >
> > > >
> > > > This e-mail and any files transmitted with it are for the sole
> use
> > > of the
> > > > intended recipient(s) and may contain confidential and privileged
> > > > information. If you are not the intended recipient(s), please
> reply
> > > to the
> > > > sender and destroy all copies of the original message. Any
> > > unauthorized
> > > > review, use, disclosure, dissemination, forwarding, printing or
> > > copying of
> > > > this email, and/or any action taken in reliance on the contents
> of
> > > this
> > > > e-mail is strictly prohibited and may be unlawful. Where
> permitted
> > by
> > > > applicable law, this e-mail and other e-mail communications sent
> to
> > > and
> > 

Kafka-client 2.5.0 connection to Azure Event Hub authentication failure

2020-08-07 Thread Schwilk David (IOC/PAP-TH)
Hello,

When trying to connect our Kafka client to an Azure Event Hub via SASL_SSL we 
encounter an error in the authentication process.

IllegalSaslStateException: Invalid SASL mechanism response, server may be 
expecting a different protocol at 2020-08-07T05:03:16.072637487Z,
trace: org.apache.kafka.common.errors.IllegalSaslStateException: Invalid SASL 
mechanism response, server may be expecting a different protocol Caused by:
org.apache.kafka.common.protocol.types.SchemaException: Error reading field 
auth_bytes: Bytes size -1 cannot be negative at
org.apache.kafka.common.protocol.types.Schema.read(Schema.java:110) at
org.apache.kafka.common.protocol.ApiKeys.parseResponse(ApiKeys.java:313) at
org.apache.kafka.clients.NetworkClient.parseStructMaybeUpdateThrottleTimeMetrics(NetworkClient.java:725)
 at
org.apache.kafka.clients.NetworkClient.parseResponse(NetworkClient.java:712) at
org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.receiveKafkaResponse(SaslClientAuthenticator.java:523)
 at
org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.receiveToken(SaslClientAuthenticator.java:457)
 at
org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.authenticate(SaslClientAuthenticator.java:266)
 at
org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:177) at
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:547) at
org.apache.kafka.common.network.Selector.poll(Selector.java:485) at
org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:549) at
org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:324) at
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239) at
java.base/java.lang.Thread.run(Thread.java:836)

It seems like auth_bytes is incorrect in the token response from Event Hub.
Previously when using the client 2.1.1 the connections were working.
The sasl configuration with which we’re connecting seems correct to me and was 
working on the 2.1.1 client as well:

bootstrap.servers=XXX.servicebus.windows.net:9093
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule 
required username="$ConnectionString" 
password="Endpoint=sb://.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=**";

Were there some changes since 2.1.1, which could cause something like that?/ Is 
this error known to you?

Best regards
David Schwilk

Bosch IoT Things- Product Area IoT Platform (IOC/PAP-TH)
Bosch.IO GmbH | Ziegelei 7 | 88090 Immenstaad | GERMANY | www.bosch.io
david.schw...@bosch-si.com

Sitz: Berlin, Registergericht: Amtsgericht Charlottenburg; HRB 148411 B
Aufsichtsratsvorsitzender: Dr.-Ing. Thorsten Lücke; Geschäftsführung: Dr. 
Stefan Ferber, Dr. Aleksandar Mitrovic, Yvonne Reckling

​