[jira] [Commented] (KAFKA-3261) Consolidate class kafka.cluster.BrokerEndPoint and kafka.cluster.EndPoint

2016-02-28 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171006#comment-15171006
 ] 

Ismael Juma commented on KAFKA-3261:


[~chenzhu], you can still consolidate the regexes, right?

> Consolidate class kafka.cluster.BrokerEndPoint and kafka.cluster.EndPoint
> -
>
> Key: KAFKA-3261
> URL: https://issues.apache.org/jira/browse/KAFKA-3261
> Project: Kafka
>  Issue Type: Bug
>Reporter: Guozhang Wang
>Assignee: chen zhu
>
> These two classes are serving similar purposes and can be consolidated. Also 
> as [~sasakitoa] suggested we can remove their "uriParseExp" variables but use 
> (a possibly modified)
> {code}
> private static final Pattern HOST_PORT_PATTERN = 
> Pattern.compile(".*?\\[?([0-9a-zA-Z\\-.:]*)\\]?:([0-9]+)");
> {code}
> in org.apache.kafka.common.utils.Utils instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-02-28 Thread Becket Qin
Hi Guozhang,

The size of memory mapped index file was also our concern as well. That is
why we are suggesting minute level time indexing instead of second level.
There are a few thoughts on the extra memory cost of time index.

1. Currently all the index files are loaded as memory mapped files. Notice
that only the index of the active segment is of the default size 10MB.
Typically the index of the old segments are much smaller than 10MB. So if
we use the same initial size for time index files, the total amount of
memory won't be doubled, but the memory cost of active segments will be
doubled. (However, the 10MB value itself seems problematic, see later
reasoning).

2. It is likely that the time index is much smaller than the offset index
because user would adjust the time index interval ms depending on the topic
volume. i.e for a low volume topic the time index interval ms will be much
longer so that we can avoid inserting one time index entry for each message
in the extreme case.

3. To further guard against the unnecessary frequent insertion of time
index entry, we used the index.interval.bytes as a restriction for time
index entry as well. Such that even for a newly created topic with the
default time.index.interval.ms we don't need to worry about overly
aggressive time index entry insertion.

Considering the above. The overall memory cost for time index should be
much smaller compared with the offset index. However, as you pointed out
for (1) might still be an issue. I am actually not sure about why we always
allocate 10 MB for the index file. This itself looks a problem given we
actually have a pretty good way to know the upper bound of memory taken by
an offset index.

Theoretically, the offset index file will at most have (log.segment.bytes /
index.interval.bytes) entries. In our default configuration,
log.segment.size=1GB, and index.interval.bytes=4K. This means we only need
(1GB/4K)*8 Bytes = 2MB. Allocating 10 MB is really a big waste of memory.

I suggest we do the following:
1. When creating the log index file, we always allocate memory using the
above calculation.
2. If the memory calculated in (1) is greater than segment.index.bytes, we
use segment.index.bytes instead. Otherwise we simply use the result in (1)

If we do this I believe the memory for index file will probably be smaller
even if we have the time index added. I will create a separate ticket for
the index file initial size.

Thanks,

Jiangjie (Becket) Qin

On Thu, Feb 25, 2016 at 3:30 PM, Guozhang Wang  wrote:

> Jiangjie,
>
> I was originally only thinking about the "time.index.size.max.bytes" config
> in addition to the "offset.index.size.max.bytes". Since the latter's
> default size is 10MB, and for memory mapped file, we will allocate that
> much of memory at the start which could be a pressure on RAM if we double
> it.
>
> Guozhang
>
> On Wed, Feb 24, 2016 at 4:56 PM, Becket Qin  wrote:
>
> > Hi Guozhang,
> >
> > I thought about this again and it seems we stilll need the
> > time.index.interval.ms configuration to avoid unnecessary frequent time
> > index insertion.
> >
> > I just updated the wiki to add index.interval.bytes as an additional
> > constraints for time index entry insertion. Another slight change made
> was
> > that as long as a message timestamp shows time.index.interval.ms has
> > passed
> > since the timestamp of last time index entry, we will insert another
> > timestmap index entry. Previously we always insert time index at
> > time.index.interval.ms bucket boundaries.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Wed, Feb 24, 2016 at 2:40 PM, Becket Qin 
> wrote:
> >
> > > Thanks for the comment Guozhang,
> > >
> > > I just changed the configuration name to "time.index.interval.ms".
> > >
> > > It seems the real question here is how big the offset indices will be.
> > > Theoretically we can have one time index entry for each message in a
> log
> > > segment. For example, if there is one event per minute appended, we
> might
> > > have to have a time index entry for each message until the segment size
> > is
> > > reached. In that case, the number of index entries in the time index
> > would
> > > be (segment size / avg message size). So the time index file size can
> > > potentially be big.
> > >
> > > I am wondering if we can simply reuse the "index.interval.bytes"
> > > configuration instead of having a separate time index interval ms. i.e.
> > > instead of inserting a new entry based on time interval, we still
> insert
> > it
> > > based on bytes interval. This does not affect the granularity because
> we
> > > can search from the nearest index entry to find the message with
> correct
> > > timestamp. The good thing is that this guarantees there will not be
> huge
> > > time indices. We also save the new configuration.
> > >
> > > What do you think?
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Wed, Feb 24, 2016 at 1:00 PM, Guozhang Wang 
> > wrote:
> > >
> > >> Thanks

[jira] [Created] (KAFKA-3300) Calculate the initial size allocation of offset index files and reduce the memory footprint for memory mapped files.

2016-02-28 Thread Jiangjie Qin (JIRA)
Jiangjie Qin created KAFKA-3300:
---

 Summary: Calculate the initial size allocation of offset index 
files and reduce the memory footprint for memory mapped files.
 Key: KAFKA-3300
 URL: https://issues.apache.org/jira/browse/KAFKA-3300
 Project: Kafka
  Issue Type: Improvement
Reporter: Jiangjie Qin
Assignee: Jiangjie Qin


Currently the initial/max size of offset index file is configured by 
{{log.index.max.bytes}}. This will be the offset index file size for active log 
segment until it rolls out. 

Theoretically, we can calculate the upper bound of offset index size using the 
following formula:
{noformat}
log.segment.bytes / index.interval.bytes * 8
{noformat}

With default setting the bytes needed for an offset index size is 1GB / 4K * 8 
= 2MB. And the default log.index.max.bytes is 10MB.

This means we are over-allocating at least 8MB on disk and mapping it to memory.

We can probably do the following:
1. When creating a new offset index, calculate the size using the above formula,
2. If the result in (1) is greater than log.index.max.bytes, we allocate 
log.index.max.bytes instead.

This should be able to significantly save memory if a broker has a lot of 
partitions on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-3300) Calculate the initial size allocation of offset index files and reduce the memory footprint for memory mapped files.

2016-02-28 Thread Jiangjie Qin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiangjie Qin updated KAFKA-3300:

Affects Version/s: 0.9.0.1

> Calculate the initial size allocation of offset index files and reduce the 
> memory footprint for memory mapped files.
> 
>
> Key: KAFKA-3300
> URL: https://issues.apache.org/jira/browse/KAFKA-3300
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 0.9.0.1
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
> Fix For: 0.10.0.0
>
>
> Currently the initial/max size of offset index file is configured by 
> {{log.index.max.bytes}}. This will be the offset index file size for active 
> log segment until it rolls out. 
> Theoretically, we can calculate the upper bound of offset index size using 
> the following formula:
> {noformat}
> log.segment.bytes / index.interval.bytes * 8
> {noformat}
> With default setting the bytes needed for an offset index size is 1GB / 4K * 
> 8 = 2MB. And the default log.index.max.bytes is 10MB.
> This means we are over-allocating at least 8MB on disk and mapping it to 
> memory.
> We can probably do the following:
> 1. When creating a new offset index, calculate the size using the above 
> formula,
> 2. If the result in (1) is greater than log.index.max.bytes, we allocate 
> log.index.max.bytes instead.
> This should be able to significantly save memory if a broker has a lot of 
> partitions on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-3300) Calculate the initial size allocation of offset index files and reduce the memory footprint for memory mapped files.

2016-02-28 Thread Jiangjie Qin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiangjie Qin updated KAFKA-3300:

Fix Version/s: 0.10.0.0

> Calculate the initial size allocation of offset index files and reduce the 
> memory footprint for memory mapped files.
> 
>
> Key: KAFKA-3300
> URL: https://issues.apache.org/jira/browse/KAFKA-3300
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 0.9.0.1
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
> Fix For: 0.10.0.0
>
>
> Currently the initial/max size of offset index file is configured by 
> {{log.index.max.bytes}}. This will be the offset index file size for active 
> log segment until it rolls out. 
> Theoretically, we can calculate the upper bound of offset index size using 
> the following formula:
> {noformat}
> log.segment.bytes / index.interval.bytes * 8
> {noformat}
> With default setting the bytes needed for an offset index size is 1GB / 4K * 
> 8 = 2MB. And the default log.index.max.bytes is 10MB.
> This means we are over-allocating at least 8MB on disk and mapping it to 
> memory.
> We can probably do the following:
> 1. When creating a new offset index, calculate the size using the above 
> formula,
> 2. If the result in (1) is greater than log.index.max.bytes, we allocate 
> log.index.max.bytes instead.
> This should be able to significantly save memory if a broker has a lot of 
> partitions on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-3300) Calculate the initial/max size of offset index files and reduce the memory footprint for memory mapped index files.

2016-02-28 Thread Jiangjie Qin (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiangjie Qin updated KAFKA-3300:

Summary: Calculate the initial/max size of offset index files and reduce 
the memory footprint for memory mapped index files.  (was: Calculate the 
initial size allocation of offset index files and reduce the memory footprint 
for memory mapped files.)

> Calculate the initial/max size of offset index files and reduce the memory 
> footprint for memory mapped index files.
> ---
>
> Key: KAFKA-3300
> URL: https://issues.apache.org/jira/browse/KAFKA-3300
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 0.9.0.1
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
> Fix For: 0.10.0.0
>
>
> Currently the initial/max size of offset index file is configured by 
> {{log.index.max.bytes}}. This will be the offset index file size for active 
> log segment until it rolls out. 
> Theoretically, we can calculate the upper bound of offset index size using 
> the following formula:
> {noformat}
> log.segment.bytes / index.interval.bytes * 8
> {noformat}
> With default setting the bytes needed for an offset index size is 1GB / 4K * 
> 8 = 2MB. And the default log.index.max.bytes is 10MB.
> This means we are over-allocating at least 8MB on disk and mapping it to 
> memory.
> We can probably do the following:
> 1. When creating a new offset index, calculate the size using the above 
> formula,
> 2. If the result in (1) is greater than log.index.max.bytes, we allocate 
> log.index.max.bytes instead.
> This should be able to significantly save memory if a broker has a lot of 
> partitions on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-02-28 Thread Michal Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171252#comment-15171252
 ] 

Michal Harish commented on KAFKA-2729:
--

Hit this on Kafka 0.8.2.2 as well

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-3301) CommonClientConfigs.METRICS_SAMPLE_WINDOW_MS_DOC is incorrect

2016-02-28 Thread Jun Rao (JIRA)
Jun Rao created KAFKA-3301:
--

 Summary: CommonClientConfigs.METRICS_SAMPLE_WINDOW_MS_DOC  is 
incorrect
 Key: KAFKA-3301
 URL: https://issues.apache.org/jira/browse/KAFKA-3301
 Project: Kafka
  Issue Type: Improvement
Reporter: Jun Rao
 Fix For: 0.10.0.0


The text says "The number of samples maintained to compute metrics.", which is 
in correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] kafka pull request: KAFKA-3300: Avoid over allocating disk space a...

2016-02-28 Thread becketqin
GitHub user becketqin opened a pull request:

https://github.com/apache/kafka/pull/983

KAFKA-3300: Avoid over allocating disk space and memory for index files.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/becketqin/kafka KAFKA-3300

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/983.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #983


commit b49a9af4c19513e458ced92ef49504f7a1c237df
Author: Jiangjie Qin 
Date:   2016-02-29T01:39:18Z

KAFKA-3300: Avoid over allocating disk space and memory for index files.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (KAFKA-3300) Calculate the initial/max size of offset index files and reduce the memory footprint for memory mapped index files.

2016-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171293#comment-15171293
 ] 

ASF GitHub Bot commented on KAFKA-3300:
---

GitHub user becketqin opened a pull request:

https://github.com/apache/kafka/pull/983

KAFKA-3300: Avoid over allocating disk space and memory for index files.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/becketqin/kafka KAFKA-3300

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/983.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #983


commit b49a9af4c19513e458ced92ef49504f7a1c237df
Author: Jiangjie Qin 
Date:   2016-02-29T01:39:18Z

KAFKA-3300: Avoid over allocating disk space and memory for index files.




> Calculate the initial/max size of offset index files and reduce the memory 
> footprint for memory mapped index files.
> ---
>
> Key: KAFKA-3300
> URL: https://issues.apache.org/jira/browse/KAFKA-3300
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 0.9.0.1
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
> Fix For: 0.10.0.0
>
>
> Currently the initial/max size of offset index file is configured by 
> {{log.index.max.bytes}}. This will be the offset index file size for active 
> log segment until it rolls out. 
> Theoretically, we can calculate the upper bound of offset index size using 
> the following formula:
> {noformat}
> log.segment.bytes / index.interval.bytes * 8
> {noformat}
> With default setting the bytes needed for an offset index size is 1GB / 4K * 
> 8 = 2MB. And the default log.index.max.bytes is 10MB.
> This means we are over-allocating at least 8MB on disk and mapping it to 
> memory.
> We can probably do the following:
> 1. When creating a new offset index, calculate the size using the above 
> formula,
> 2. If the result in (1) is greater than log.index.max.bytes, we allocate 
> log.index.max.bytes instead.
> This should be able to significantly save memory if a broker has a lot of 
> partitions on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)