[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-02-03 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851905#comment-15851905
 ] 

Ismael Juma commented on KAFKA-4614:


Thanks [~thetaphi]. We are aware of the issue and we are tracking it in 
KAFKA-4501. You will notice that we have a link to the Lucene fix there 
already. :)

> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>  Labels: latency, performance
> Fix For: 0.10.2.0
>
> Attachments: kafka-produce-99th.png
>
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
>  0xa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
>  0xa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
>  0xa0326f34 : xfs_bunmapi+0x314/0x850 [xfs]
>  0xa034ad79 : xfs_itruncate_extents+0xe9/0x280 [xfs]
>  0xa0366de5 : xfs_inactive+0x2f5/0x450 [xfs]
>  0xa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
>  0x811affbc : clear_inode+0xac/0x140 [kernel]
>  0x811b0776 : generic_delete_inode+0x196/0x1d0 [kernel]
>  0x811b0815 : generic_drop_inode+0x65/0x80 [kernel]
>  0x811af662 : iput+0x62/0x70 [kernel]
>  0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
>  0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 
> [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
>  0x7ff269a1307e
> {code}
> We took a jstack dump of the broker process and found that tid = 126123 
> corresponds to the thread which is for GC(hex(tid) == nid(Native Thread ID)):
> {code}
> $ grep 0x1ecab /tmp/jstack.dump
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7ff278d0c800 
> nid=0x1ecab in Object.wait() [0x7ff17da11000]
> {code}
> In order to confirm, we enabled 

[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-02-03 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851904#comment-15851904
 ] 

Jun Rao commented on KAFKA-4614:


[~thetaphi], thanks for the info. Much appreciated!

> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>  Labels: latency, performance
> Fix For: 0.10.2.0
>
> Attachments: kafka-produce-99th.png
>
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
>  0xa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
>  0xa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
>  0xa0326f34 : xfs_bunmapi+0x314/0x850 [xfs]
>  0xa034ad79 : xfs_itruncate_extents+0xe9/0x280 [xfs]
>  0xa0366de5 : xfs_inactive+0x2f5/0x450 [xfs]
>  0xa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
>  0x811affbc : clear_inode+0xac/0x140 [kernel]
>  0x811b0776 : generic_delete_inode+0x196/0x1d0 [kernel]
>  0x811b0815 : generic_drop_inode+0x65/0x80 [kernel]
>  0x811af662 : iput+0x62/0x70 [kernel]
>  0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
>  0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 
> [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
>  0x7ff269a1307e
> {code}
> We took a jstack dump of the broker process and found that tid = 126123 
> corresponds to the thread which is for GC(hex(tid) == nid(Native Thread ID)):
> {code}
> $ grep 0x1ecab /tmp/jstack.dump
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7ff278d0c800 
> nid=0x1ecab in Object.wait() [0x7ff17da11000]
> {code}
> In order to confirm, we enabled {{PrintGCApplicationStoppedTime}} switch and 
> confirmed that in total the time the broker paused is longer than usual, 

[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-02-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851847#comment-15851847
 ] 

Uwe Schindler commented on KAFKA-4614:
--

Just to inform you. Cleaner is no longer available in Java 9. So you need 
another hack. In cooperation with Apache Lucene, Oracle added a workaround that 
works differently for Java 9 as a temporary workaround until unmapping ist 
supported in Java 10. I am traveling to FOSDEM at the moment where I will talk 
about this an meet Mark Reinhold to discuss about the Java 10 solution.

The Lucene issue with a nonreflective MethodHandle approach for unmapping in 
Java 7, 8 and 9 can be found  here: LUCENE-6989

> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>  Labels: latency, performance
> Fix For: 0.10.2.0
>
> Attachments: kafka-produce-99th.png
>
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
>  0xa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
>  0xa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
>  0xa0326f34 : xfs_bunmapi+0x314/0x850 [xfs]
>  0xa034ad79 : xfs_itruncate_extents+0xe9/0x280 [xfs]
>  0xa0366de5 : xfs_inactive+0x2f5/0x450 [xfs]
>  0xa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
>  0x811affbc : clear_inode+0xac/0x140 [kernel]
>  0x811b0776 : generic_delete_inode+0x196/0x1d0 [kernel]
>  0x811b0815 : generic_drop_inode+0x65/0x80 [kernel]
>  0x811af662 : iput+0x62/0x70 [kernel]
>  0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
>  0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 
> [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
>  0x7ff269a1307e
> {code}
> We took a 

[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-02-02 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15851097#comment-15851097
 ] 

Guozhang Wang commented on KAFKA-4614:
--

[~kawamuray] Thanks for sharing!!

> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>  Labels: latency, performance
> Fix For: 0.10.2.0
>
> Attachments: kafka-produce-99th.png
>
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
>  0xa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
>  0xa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
>  0xa0326f34 : xfs_bunmapi+0x314/0x850 [xfs]
>  0xa034ad79 : xfs_itruncate_extents+0xe9/0x280 [xfs]
>  0xa0366de5 : xfs_inactive+0x2f5/0x450 [xfs]
>  0xa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
>  0x811affbc : clear_inode+0xac/0x140 [kernel]
>  0x811b0776 : generic_delete_inode+0x196/0x1d0 [kernel]
>  0x811b0815 : generic_drop_inode+0x65/0x80 [kernel]
>  0x811af662 : iput+0x62/0x70 [kernel]
>  0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
>  0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 
> [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
>  0x7ff269a1307e
> {code}
> We took a jstack dump of the broker process and found that tid = 126123 
> corresponds to the thread which is for GC(hex(tid) == nid(Native Thread ID)):
> {code}
> $ grep 0x1ecab /tmp/jstack.dump
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7ff278d0c800 
> nid=0x1ecab in Object.wait() [0x7ff17da11000]
> {code}
> In order to confirm, we enabled {{PrintGCApplicationStoppedTime}} switch and 
> confirmed that in total the time the broker paused is longer than usual, when 

[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-01-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830330#comment-15830330
 ] 

ASF GitHub Bot commented on KAFKA-4614:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2352


> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>  Labels: latency, performance
> Fix For: 0.10.2.0
>
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
>  0xa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
>  0xa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
>  0xa0326f34 : xfs_bunmapi+0x314/0x850 [xfs]
>  0xa034ad79 : xfs_itruncate_extents+0xe9/0x280 [xfs]
>  0xa0366de5 : xfs_inactive+0x2f5/0x450 [xfs]
>  0xa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
>  0x811affbc : clear_inode+0xac/0x140 [kernel]
>  0x811b0776 : generic_delete_inode+0x196/0x1d0 [kernel]
>  0x811b0815 : generic_drop_inode+0x65/0x80 [kernel]
>  0x811af662 : iput+0x62/0x70 [kernel]
>  0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
>  0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 
> [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
>  0x7ff269a1307e
> {code}
> We took a jstack dump of the broker process and found that tid = 126123 
> corresponds to the thread which is for GC(hex(tid) == nid(Native Thread ID)):
> {code}
> $ grep 0x1ecab /tmp/jstack.dump
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7ff278d0c800 
> nid=0x1ecab in Object.wait() [0x7ff17da11000]
> {code}
> In order to confirm, we enabled {{PrintGCApplicationStoppedTime}} switch and 
> confirmed that in total the time the broker paused is longer 

[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-01-16 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823762#comment-15823762
 ] 

Ismael Juma commented on KAFKA-4614:


[~ewencp], sure. I assigned [~junrao] as the reviewer since the fix itself is 
trivial, but it would be good for him to take a look. I'll do a pass through 
all unfixed Core JIRAs still assigned to 0.10.2.0 towards the end of the week 
to ensure that we either make progress or drop them from the release.

> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>  Labels: latency, performance
> Fix For: 0.10.2.0
>
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
>  0xa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
>  0xa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
>  0xa0326f34 : xfs_bunmapi+0x314/0x850 [xfs]
>  0xa034ad79 : xfs_itruncate_extents+0xe9/0x280 [xfs]
>  0xa0366de5 : xfs_inactive+0x2f5/0x450 [xfs]
>  0xa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
>  0x811affbc : clear_inode+0xac/0x140 [kernel]
>  0x811b0776 : generic_delete_inode+0x196/0x1d0 [kernel]
>  0x811b0815 : generic_drop_inode+0x65/0x80 [kernel]
>  0x811af662 : iput+0x62/0x70 [kernel]
>  0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
>  0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 
> [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
>  0x7ff269a1307e
> {code}
> We took a jstack dump of the broker process and found that tid = 126123 
> corresponds to the thread which is for GC(hex(tid) == nid(Native Thread ID)):
> {code}
> $ grep 0x1ecab /tmp/jstack.dump
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7ff278d0c800 
> 

[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-01-15 Thread Ewen Cheslack-Postava (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823442#comment-15823442
 ] 

Ewen Cheslack-Postava commented on KAFKA-4614:
--

[~ijuma] Seems like you started tracking work on this -- can you assign a 
reviewer (whether yourself or someone else, even non-committer) that can help 
shepherd this for 0.10.2.0 since it sounds like folks are looking to land a fix 
before the code freeze?

> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>  Labels: latency, performance
> Fix For: 0.10.2.0
>
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
>  0xa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
>  0xa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
>  0xa0326f34 : xfs_bunmapi+0x314/0x850 [xfs]
>  0xa034ad79 : xfs_itruncate_extents+0xe9/0x280 [xfs]
>  0xa0366de5 : xfs_inactive+0x2f5/0x450 [xfs]
>  0xa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
>  0x811affbc : clear_inode+0xac/0x140 [kernel]
>  0x811b0776 : generic_delete_inode+0x196/0x1d0 [kernel]
>  0x811b0815 : generic_drop_inode+0x65/0x80 [kernel]
>  0x811af662 : iput+0x62/0x70 [kernel]
>  0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
>  0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 
> [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
>  0x7ff269a1307e
> {code}
> We took a jstack dump of the broker process and found that tid = 126123 
> corresponds to the thread which is for GC(hex(tid) == nid(Native Thread ID)):
> {code}
> $ grep 0x1ecab /tmp/jstack.dump
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7ff278d0c800 
> nid=0x1ecab in 

[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-01-13 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822254#comment-15822254
 ] 

Guozhang Wang commented on KAFKA-4614:
--

Thanks for this report and thorough analysis [~kawamuray]! 

> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>  Labels: latency, performance
> Fix For: 0.10.2.0
>
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
>  0xa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
>  0xa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
>  0xa0326f34 : xfs_bunmapi+0x314/0x850 [xfs]
>  0xa034ad79 : xfs_itruncate_extents+0xe9/0x280 [xfs]
>  0xa0366de5 : xfs_inactive+0x2f5/0x450 [xfs]
>  0xa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
>  0x811affbc : clear_inode+0xac/0x140 [kernel]
>  0x811b0776 : generic_delete_inode+0x196/0x1d0 [kernel]
>  0x811b0815 : generic_drop_inode+0x65/0x80 [kernel]
>  0x811af662 : iput+0x62/0x70 [kernel]
>  0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
>  0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 
> [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
>  0x7ff269a1307e
> {code}
> We took a jstack dump of the broker process and found that tid = 126123 
> corresponds to the thread which is for GC(hex(tid) == nid(Native Thread ID)):
> {code}
> $ grep 0x1ecab /tmp/jstack.dump
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7ff278d0c800 
> nid=0x1ecab in Object.wait() [0x7ff17da11000]
> {code}
> In order to confirm, we enabled {{PrintGCApplicationStoppedTime}} switch and 
> confirmed that in total the time the broker paused is longer than usual, when 
> a broker 

[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-01-12 Thread Yuto Kawamura (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820948#comment-15820948
 ] 

Yuto Kawamura commented on KAFKA-4614:
--

I applied the patch to one of our production brokers, deployed and observe over 
24 hours.
As the result, this issue disappeared completely. As I wrote in description, it 
was happening at least 3 ~ 4 times a day so I think we can say the patch solved 
the problem.

[~ijuma] Filled up a PR. Please review if possible.

[~huxi_2b]
Thanks for your comments.

> Kafka already offers a method named 'forceUnmap' that actually does this kind 
> of thing
this is exactly what I did ;)

> If others use MappedByteBuffer later, it will crash.
Right. I think it is somewhat a dangerous solution, but still think its safe 
because access to mmap object is limited to {{AbstractIndex}} and it's 
subclasses. Indeed we need to be careful for changing things around there 
though.

> Portability risk, although sun.misc.Cleaner is present in most of major JVM 
> vendors.
Right, but this class is already imported to support windows so I don't think 
this is a problem.

> System.gc()
I think this doesn't work at all. AFAIK:
- System.gc() does not guarantee(JVM does "best effort") to collect all garbage 
which are available to be collected.
- No matter if GC triggered manually or not, it blocks application threads by 
the same reason.

[~miguno] Thanks. The investigation was hard but was a great opportunity to 
learn many new things ;)


> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>  Labels: latency, performance
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
>  0xa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]

[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-01-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820924#comment-15820924
 ] 

ASF GitHub Bot commented on KAFKA-4614:
---

GitHub user kawamuray opened a pull request:

https://github.com/apache/kafka/pull/2352

KAFKA-4614 Forcefully unmap mmap of OffsetIndex to prevent long GC pause

Issue: https://issues.apache.org/jira/browse/KAFKA-4614

Fixes the problem that the broker threads suffered by long GC pause.
When GC thread collects mmap objects which were created for index files, it 
unmaps memory mapping so kernel turns to delete a file physically. This work 
may transparently read file's metadata from physical disk if it's not available 
on cache.
This seems to happen typically when we're using G1GC, due to it's strategy 
to left a garbage for a long time if other objects in the same region are still 
alive.
See the link for the details.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kawamuray/kafka 
KAFKA-4614-force-munmap-for-index

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2352.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2352


commit 8c3a4c9f5a6188c641a45cb3e111094f545f7e66
Author: Yuto Kawamura 
Date:   2017-01-11T08:10:05Z

KAFKA-4614 Forcefully unmap mmap of OffsetIndex to prevent long GC pause




> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>  Labels: latency, performance
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : 

[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-01-12 Thread Michael Noll (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15820875#comment-15820875
 ] 

Michael Noll commented on KAFKA-4614:
-

[~kawamuray] Sorry to hear that you ran into this.  Thanks a lot for the 
thorough investigation!

> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>  Labels: latency, performance
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
>  0xa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
>  0xa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
>  0xa0326f34 : xfs_bunmapi+0x314/0x850 [xfs]
>  0xa034ad79 : xfs_itruncate_extents+0xe9/0x280 [xfs]
>  0xa0366de5 : xfs_inactive+0x2f5/0x450 [xfs]
>  0xa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
>  0x811affbc : clear_inode+0xac/0x140 [kernel]
>  0x811b0776 : generic_delete_inode+0x196/0x1d0 [kernel]
>  0x811b0815 : generic_drop_inode+0x65/0x80 [kernel]
>  0x811af662 : iput+0x62/0x70 [kernel]
>  0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
>  0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 
> [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
>  0x7ff269a1307e
> {code}
> We took a jstack dump of the broker process and found that tid = 126123 
> corresponds to the thread which is for GC(hex(tid) == nid(Native Thread ID)):
> {code}
> $ grep 0x1ecab /tmp/jstack.dump
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7ff278d0c800 
> nid=0x1ecab in Object.wait() [0x7ff17da11000]
> {code}
> In order to confirm, we enabled {{PrintGCApplicationStoppedTime}} switch and 
> confirmed that in total the time the broker paused is longer than usual, when 
> a broker 

[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-01-11 Thread huxi (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817849#comment-15817849
 ] 

huxi commented on KAFKA-4614:
-

Great catch for the whole things. 

As for the unmap things, seems there is no perfect solution at this moment. 
Java does not offer a corresponding unmap that invokes underlying munmap system 
call for the mapped file. See [the jdk 
bug|http://bugs.java.com/view_bug.do?bug_id=4724038]

Right now there exists two "not-that-perfect" solutions:
1. Explicitly unmap the mapped file as [~kawamuray] said via invoking 
'DirectBuffer.cleaner().clean()'. By the way, Kafka already offers a method 
named 'forceUnmap' that actually does this kind of thing. We could simply reuse 
this method. However, this solution has two drawbacks: 1. If others use 
MappedByteBuffer later, it will crash. 2. Portability risk, although 
sun.misc.Cleaner is present in most of major JVM vendors.
2 Explicitly invoke 'System.gc()', which is also a good enough idea, especially 
when setting -XX:+DisableExplicitGC.

> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
>  0xa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
>  0xa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
>  0xa0326f34 : xfs_bunmapi+0x314/0x850 [xfs]
>  0xa034ad79 : xfs_itruncate_extents+0xe9/0x280 [xfs]
>  0xa0366de5 : xfs_inactive+0x2f5/0x450 [xfs]
>  0xa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
>  0x811affbc : clear_inode+0xac/0x140 [kernel]
>  0x811b0776 : generic_delete_inode+0x196/0x1d0 [kernel]
>  0x811b0815 : generic_drop_inode+0x65/0x80 [kernel]
>  0x811af662 : 

[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-01-11 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817826#comment-15817826
 ] 

Ismael Juma commented on KAFKA-4614:


[~kawamuray], thanks for reporting this. Looking forward to your results.

> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
>  0xa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
>  0xa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
>  0xa0326f34 : xfs_bunmapi+0x314/0x850 [xfs]
>  0xa034ad79 : xfs_itruncate_extents+0xe9/0x280 [xfs]
>  0xa0366de5 : xfs_inactive+0x2f5/0x450 [xfs]
>  0xa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
>  0x811affbc : clear_inode+0xac/0x140 [kernel]
>  0x811b0776 : generic_delete_inode+0x196/0x1d0 [kernel]
>  0x811b0815 : generic_drop_inode+0x65/0x80 [kernel]
>  0x811af662 : iput+0x62/0x70 [kernel]
>  0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
>  0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 
> [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
>  0x7ff269a1307e
> {code}
> We took a jstack dump of the broker process and found that tid = 126123 
> corresponds to the thread which is for GC(hex(tid) == nid(Native Thread ID)):
> {code}
> $ grep 0x1ecab /tmp/jstack.dump
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7ff278d0c800 
> nid=0x1ecab in Object.wait() [0x7ff17da11000]
> {code}
> In order to confirm, we enabled {{PrintGCApplicationStoppedTime}} switch and 
> confirmed that in total the time the broker paused is longer than usual, when 
> a broker experiencing the issue.
> {code}
> $ grep --text 'Total time for which 

[jira] [Commented] (KAFKA-4614) Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

2017-01-11 Thread Yuto Kawamura (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817627#comment-15817627
 ] 

Yuto Kawamura commented on KAFKA-4614:
--

I've tried to fix this and now I'm testing patched broker in our environment.
Will submit a PR once it confirmed to fix this problem.

> Long GC pause harming broker performance which is caused by mmap objects 
> created for OffsetIndex
> 
>
> Key: KAFKA-4614
> URL: https://issues.apache.org/jira/browse/KAFKA-4614
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 0.10.0.1
>Reporter: Yuto Kawamura
>Assignee: Yuto Kawamura
>
> First let me clarify our system environment information as I think it's 
> important to understand this issue:
> OS: CentOS6
> Kernel version: 2.6.32-XX
> Filesystem used for data volume: XFS
> Java version: 1.8.0_66
> GC option: Kafka default(G1GC)
> Kafka version: 0.10.0.1
> h2. Phenomenon
> In our Kafka cluster, an usual response time for Produce request is about 1ms 
> for 50th percentile to 10ms for 99th percentile. All topics are configured to 
> have 3 replicas and all producers are configured {{acks=all}} so this time 
> includes replication latency.
> Sometimes we observe 99th percentile latency get increased to 100ms ~ 500ms 
> but for the most cases the time consuming part is "Remote" which means it is 
> caused by slow replication which is known to happen by various reasons(which 
> is also an issue that we're trying to improve, but out of interest within 
> this issue).
> However, we found that there are some different patterns which happens rarely 
> but stationary 3 ~ 5 times a day for each servers. The interesting part is 
> that "RequestQueue" also got increased as well as "Total" and "Remote".
> At the same time, we observed that the disk read metrics(in terms of both 
> read bytes and read time) spikes exactly for the same moment. Currently we 
> have only caught up consumers so this metric sticks to zero while all 
> requests are served by page cache.
> In order to investigate what Kafka is "read"ing, I employed SystemTap and 
> wrote the following script. It traces all disk reads(only actual read by 
> physical device) made for the data volume by broker process.
> {code}
> global target_pid = KAFKA_PID
> global target_dev = DATA_VOLUME
> probe ioblock.request {
>   if (rw == BIO_READ && pid() == target_pid && devname == target_dev) {
>  t_ms = gettimeofday_ms() + 9 * 3600 * 1000 // timezone adjustment
>  printf("%s,%03d:  tid = %d, device = %s, inode = %d, size = %d\n", 
> ctime(t_ms / 1000), t_ms % 1000, tid(), devname, ino, size)
>  print_backtrace()
>  print_ubacktrace()
>   }
> }
> {code}
> As the result, we could observe many logs like below:
> {code}
> Thu Dec 22 17:21:39 2016,209:  tid = 126123, device = sdb1, inode = -1, size 
> = 4096
>  0x81275050 : generic_make_request+0x0/0x5a0 [kernel]
>  0x81275660 : submit_bio+0x70/0x120 [kernel]
>  0xa036bcaa : _xfs_buf_ioapply+0x16a/0x200 [xfs]
>  0xa036d95f : xfs_buf_iorequest+0x4f/0xe0 [xfs]
>  0xa036db46 : _xfs_buf_read+0x36/0x60 [xfs]
>  0xa036dc1b : xfs_buf_read+0xab/0x100 [xfs]
>  0xa0363477 : xfs_trans_read_buf+0x1f7/0x410 [xfs]
>  0xa033014e : xfs_btree_read_buf_block+0x5e/0xd0 [xfs]
>  0xa0330854 : xfs_btree_lookup_get_block+0x84/0xf0 [xfs]
>  0xa0330edf : xfs_btree_lookup+0xbf/0x470 [xfs]
>  0xa032456f : xfs_bmbt_lookup_eq+0x1f/0x30 [xfs]
>  0xa032628b : xfs_bmap_del_extent+0x12b/0xac0 [xfs]
>  0xa0326f34 : xfs_bunmapi+0x314/0x850 [xfs]
>  0xa034ad79 : xfs_itruncate_extents+0xe9/0x280 [xfs]
>  0xa0366de5 : xfs_inactive+0x2f5/0x450 [xfs]
>  0xa0374620 : xfs_fs_clear_inode+0xa0/0xd0 [xfs]
>  0x811affbc : clear_inode+0xac/0x140 [kernel]
>  0x811b0776 : generic_delete_inode+0x196/0x1d0 [kernel]
>  0x811b0815 : generic_drop_inode+0x65/0x80 [kernel]
>  0x811af662 : iput+0x62/0x70 [kernel]
>  0x37ff2e5347 : munmap+0x7/0x30 [/lib64/libc-2.12.so]
>  0x7ff169ba5d47 : Java_sun_nio_ch_FileChannelImpl_unmap0+0x17/0x50 
> [/usr/jdk1.8.0_66/jre/lib/amd64/libnio.so]
>  0x7ff269a1307e
> {code}
> We took a jstack dump of the broker process and found that tid = 126123 
> corresponds to the thread which is for GC(hex(tid) == nid(Native Thread ID)):
> {code}
> $ grep 0x1ecab /tmp/jstack.dump
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7ff278d0c800 
> nid=0x1ecab in Object.wait() [0x7ff17da11000]
> {code}
> In order to confirm, we enabled {{PrintGCApplicationStoppedTime}} switch and 
> confirmed that in total the time the broker paused is longer than usual, when 
> a broker experiencing