[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput
[ https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1641#comment-1641 ] Mostafa Mokhtar commented on KUDU-2086: --- Higher number of reactor threads and reduced tcmalloc contention in the reactor thread code path alleviated the issue. > Uneven assignment of connections to Reactor threads creates skew and limits > transfer throughput > --- > > Key: KUDU-2086 > URL: https://issues.apache.org/jira/browse/KUDU-2086 > Project: Kudu > Issue Type: Improvement > Components: rpc >Affects Versions: 1.4.0 >Reporter: Mostafa Mokhtar >Assignee: Joe McDonnell >Priority: Major > Attachments: krpc_hash_test.c > > > Uneven assignment of connections to Reactor threads causes a couple of > reactor threads to run @100% which limits overall system throughput. > Increasing the number of reactor threads alleviate the problem but some > threads are still running much hotter than others. > Snapshot below is from a 20 node cluster > {code} > ps -T -p 69387 | grep rpc | grep -v "00:00" | awk '{print $4,$0}' | sort > 00:03:17 69387 69596 ?00:03:17 rpc reactor-695 > 00:03:20 69387 69632 ?00:03:20 rpc reactor-696 > 00:03:21 69387 69607 ?00:03:21 rpc reactor-696 > 00:03:25 69387 69629 ?00:03:25 rpc reactor-696 > 00:03:26 69387 69594 ?00:03:26 rpc reactor-695 > 00:03:34 69387 69595 ?00:03:34 rpc reactor-695 > 00:03:35 69387 69625 ?00:03:35 rpc reactor-696 > 00:03:38 69387 69570 ?00:03:38 rpc reactor-695 > 00:03:38 69387 69620 ?00:03:38 rpc reactor-696 > 00:03:47 69387 69639 ?00:03:47 rpc reactor-696 > 00:03:48 69387 69593 ?00:03:48 rpc reactor-695 > 00:03:49 69387 69591 ?00:03:49 rpc reactor-695 > 00:04:04 69387 69600 ?00:04:04 rpc reactor-696 > 00:07:16 69387 69640 ?00:07:16 rpc reactor-696 > 00:07:39 69387 69616 ?00:07:39 rpc reactor-696 > 00:07:54 69387 69572 ?00:07:54 rpc reactor-695 > 00:09:10 69387 69613 ?00:09:10 rpc reactor-696 > 00:09:28 69387 69567 ?00:09:28 rpc reactor-695 > 00:09:39 69387 69603 ?00:09:39 rpc reactor-696 > 00:09:42 69387 69641 ?00:09:42 rpc reactor-696 > 00:09:59 69387 69604 ?00:09:59 rpc reactor-696 > 00:10:06 69387 69623 ?00:10:06 rpc reactor-696 > 00:10:43 69387 69636 ?00:10:43 rpc reactor-696 > 00:10:59 69387 69642 ?00:10:59 rpc reactor-696 > 00:11:28 69387 69585 ?00:11:28 rpc reactor-695 > 00:12:43 69387 69598 ?00:12:43 rpc reactor-695 > 00:15:42 69387 69578 ?00:15:42 rpc reactor-695 > 00:16:10 69387 69614 ?00:16:10 rpc reactor-696 > 00:17:43 69387 69575 ?00:17:43 rpc reactor-695 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput
[ https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409830#comment-16409830 ] Joe McDonnell commented on KUDU-2086: - [~tlipcon] Good point, I changed this to an Improvement and dropped the priority. > Uneven assignment of connections to Reactor threads creates skew and limits > transfer throughput > --- > > Key: KUDU-2086 > URL: https://issues.apache.org/jira/browse/KUDU-2086 > Project: Kudu > Issue Type: Improvement > Components: rpc >Affects Versions: 1.4.0 >Reporter: Mostafa Mokhtar >Assignee: Joe McDonnell >Priority: Major > Attachments: krpc_hash_test.c > > > Uneven assignment of connections to Reactor threads causes a couple of > reactor threads to run @100% which limits overall system throughput. > Increasing the number of reactor threads alleviate the problem but some > threads are still running much hotter than others. > Snapshot below is from a 20 node cluster > {code} > ps -T -p 69387 | grep rpc | grep -v "00:00" | awk '{print $4,$0}' | sort > 00:03:17 69387 69596 ?00:03:17 rpc reactor-695 > 00:03:20 69387 69632 ?00:03:20 rpc reactor-696 > 00:03:21 69387 69607 ?00:03:21 rpc reactor-696 > 00:03:25 69387 69629 ?00:03:25 rpc reactor-696 > 00:03:26 69387 69594 ?00:03:26 rpc reactor-695 > 00:03:34 69387 69595 ?00:03:34 rpc reactor-695 > 00:03:35 69387 69625 ?00:03:35 rpc reactor-696 > 00:03:38 69387 69570 ?00:03:38 rpc reactor-695 > 00:03:38 69387 69620 ?00:03:38 rpc reactor-696 > 00:03:47 69387 69639 ?00:03:47 rpc reactor-696 > 00:03:48 69387 69593 ?00:03:48 rpc reactor-695 > 00:03:49 69387 69591 ?00:03:49 rpc reactor-695 > 00:04:04 69387 69600 ?00:04:04 rpc reactor-696 > 00:07:16 69387 69640 ?00:07:16 rpc reactor-696 > 00:07:39 69387 69616 ?00:07:39 rpc reactor-696 > 00:07:54 69387 69572 ?00:07:54 rpc reactor-695 > 00:09:10 69387 69613 ?00:09:10 rpc reactor-696 > 00:09:28 69387 69567 ?00:09:28 rpc reactor-695 > 00:09:39 69387 69603 ?00:09:39 rpc reactor-696 > 00:09:42 69387 69641 ?00:09:42 rpc reactor-696 > 00:09:59 69387 69604 ?00:09:59 rpc reactor-696 > 00:10:06 69387 69623 ?00:10:06 rpc reactor-696 > 00:10:43 69387 69636 ?00:10:43 rpc reactor-696 > 00:10:59 69387 69642 ?00:10:59 rpc reactor-696 > 00:11:28 69387 69585 ?00:11:28 rpc reactor-695 > 00:12:43 69387 69598 ?00:12:43 rpc reactor-695 > 00:15:42 69387 69578 ?00:15:42 rpc reactor-695 > 00:16:10 69387 69614 ?00:16:10 rpc reactor-696 > 00:17:43 69387 69575 ?00:17:43 rpc reactor-695 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput
[ https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408876#comment-16408876 ] Todd Lipcon commented on KUDU-2086: --- [~joemcdonnell] afaik this isn't really an issue anymore. Perhaps we should drop it to backburner priority and classify it as an improvement rather than a bug? > Uneven assignment of connections to Reactor threads creates skew and limits > transfer throughput > --- > > Key: KUDU-2086 > URL: https://issues.apache.org/jira/browse/KUDU-2086 > Project: Kudu > Issue Type: Bug > Components: rpc >Affects Versions: 1.4.0 >Reporter: Mostafa Mokhtar >Assignee: Joe McDonnell >Priority: Critical > Attachments: krpc_hash_test.c > > > Uneven assignment of connections to Reactor threads causes a couple of > reactor threads to run @100% which limits overall system throughput. > Increasing the number of reactor threads alleviate the problem but some > threads are still running much hotter than others. > Snapshot below is from a 20 node cluster > {code} > ps -T -p 69387 | grep rpc | grep -v "00:00" | awk '{print $4,$0}' | sort > 00:03:17 69387 69596 ?00:03:17 rpc reactor-695 > 00:03:20 69387 69632 ?00:03:20 rpc reactor-696 > 00:03:21 69387 69607 ?00:03:21 rpc reactor-696 > 00:03:25 69387 69629 ?00:03:25 rpc reactor-696 > 00:03:26 69387 69594 ?00:03:26 rpc reactor-695 > 00:03:34 69387 69595 ?00:03:34 rpc reactor-695 > 00:03:35 69387 69625 ?00:03:35 rpc reactor-696 > 00:03:38 69387 69570 ?00:03:38 rpc reactor-695 > 00:03:38 69387 69620 ?00:03:38 rpc reactor-696 > 00:03:47 69387 69639 ?00:03:47 rpc reactor-696 > 00:03:48 69387 69593 ?00:03:48 rpc reactor-695 > 00:03:49 69387 69591 ?00:03:49 rpc reactor-695 > 00:04:04 69387 69600 ?00:04:04 rpc reactor-696 > 00:07:16 69387 69640 ?00:07:16 rpc reactor-696 > 00:07:39 69387 69616 ?00:07:39 rpc reactor-696 > 00:07:54 69387 69572 ?00:07:54 rpc reactor-695 > 00:09:10 69387 69613 ?00:09:10 rpc reactor-696 > 00:09:28 69387 69567 ?00:09:28 rpc reactor-695 > 00:09:39 69387 69603 ?00:09:39 rpc reactor-696 > 00:09:42 69387 69641 ?00:09:42 rpc reactor-696 > 00:09:59 69387 69604 ?00:09:59 rpc reactor-696 > 00:10:06 69387 69623 ?00:10:06 rpc reactor-696 > 00:10:43 69387 69636 ?00:10:43 rpc reactor-696 > 00:10:59 69387 69642 ?00:10:59 rpc reactor-696 > 00:11:28 69387 69585 ?00:11:28 rpc reactor-695 > 00:12:43 69387 69598 ?00:12:43 rpc reactor-695 > 00:15:42 69387 69578 ?00:15:42 rpc reactor-695 > 00:16:10 69387 69614 ?00:16:10 rpc reactor-696 > 00:17:43 69387 69575 ?00:17:43 rpc reactor-695 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput
[ https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331492#comment-16331492 ] Joe McDonnell commented on KUDU-2086: - [~tlipcon] Good point, this is not a blocker. I will lower the priority. > Uneven assignment of connections to Reactor threads creates skew and limits > transfer throughput > --- > > Key: KUDU-2086 > URL: https://issues.apache.org/jira/browse/KUDU-2086 > Project: Kudu > Issue Type: Bug > Components: rpc >Affects Versions: 1.4.0 >Reporter: Mostafa Mokhtar >Assignee: Joe McDonnell >Priority: Blocker > Attachments: krpc_hash_test.c > > > Uneven assignment of connections to Reactor threads causes a couple of > reactor threads to run @100% which limits overall system throughput. > Increasing the number of reactor threads alleviate the problem but some > threads are still running much hotter than others. > Snapshot below is from a 20 node cluster > {code} > ps -T -p 69387 | grep rpc | grep -v "00:00" | awk '{print $4,$0}' | sort > 00:03:17 69387 69596 ?00:03:17 rpc reactor-695 > 00:03:20 69387 69632 ?00:03:20 rpc reactor-696 > 00:03:21 69387 69607 ?00:03:21 rpc reactor-696 > 00:03:25 69387 69629 ?00:03:25 rpc reactor-696 > 00:03:26 69387 69594 ?00:03:26 rpc reactor-695 > 00:03:34 69387 69595 ?00:03:34 rpc reactor-695 > 00:03:35 69387 69625 ?00:03:35 rpc reactor-696 > 00:03:38 69387 69570 ?00:03:38 rpc reactor-695 > 00:03:38 69387 69620 ?00:03:38 rpc reactor-696 > 00:03:47 69387 69639 ?00:03:47 rpc reactor-696 > 00:03:48 69387 69593 ?00:03:48 rpc reactor-695 > 00:03:49 69387 69591 ?00:03:49 rpc reactor-695 > 00:04:04 69387 69600 ?00:04:04 rpc reactor-696 > 00:07:16 69387 69640 ?00:07:16 rpc reactor-696 > 00:07:39 69387 69616 ?00:07:39 rpc reactor-696 > 00:07:54 69387 69572 ?00:07:54 rpc reactor-695 > 00:09:10 69387 69613 ?00:09:10 rpc reactor-696 > 00:09:28 69387 69567 ?00:09:28 rpc reactor-695 > 00:09:39 69387 69603 ?00:09:39 rpc reactor-696 > 00:09:42 69387 69641 ?00:09:42 rpc reactor-696 > 00:09:59 69387 69604 ?00:09:59 rpc reactor-696 > 00:10:06 69387 69623 ?00:10:06 rpc reactor-696 > 00:10:43 69387 69636 ?00:10:43 rpc reactor-696 > 00:10:59 69387 69642 ?00:10:59 rpc reactor-696 > 00:11:28 69387 69585 ?00:11:28 rpc reactor-695 > 00:12:43 69387 69598 ?00:12:43 rpc reactor-695 > 00:15:42 69387 69578 ?00:15:42 rpc reactor-695 > 00:16:10 69387 69614 ?00:16:10 rpc reactor-696 > 00:17:43 69387 69575 ?00:17:43 rpc reactor-695 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput
[ https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331487#comment-16331487 ] Todd Lipcon commented on KUDU-2086: --- [~joemcdonnell] is this still a blocker? From talking with [~mmokhtar] offline recently it sounds like some changes went into Impala that drastically reduced the load on the reactor threads to the point that it isn't a big problem anymore. Might still be worth doing this eventually but we try to reserve blocker priority for serious issues like data loss. > Uneven assignment of connections to Reactor threads creates skew and limits > transfer throughput > --- > > Key: KUDU-2086 > URL: https://issues.apache.org/jira/browse/KUDU-2086 > Project: Kudu > Issue Type: Bug > Components: rpc >Affects Versions: 1.4.0 >Reporter: Mostafa Mokhtar >Assignee: Joe McDonnell >Priority: Blocker > Attachments: krpc_hash_test.c > > > Uneven assignment of connections to Reactor threads causes a couple of > reactor threads to run @100% which limits overall system throughput. > Increasing the number of reactor threads alleviate the problem but some > threads are still running much hotter than others. > Snapshot below is from a 20 node cluster > {code} > ps -T -p 69387 | grep rpc | grep -v "00:00" | awk '{print $4,$0}' | sort > 00:03:17 69387 69596 ?00:03:17 rpc reactor-695 > 00:03:20 69387 69632 ?00:03:20 rpc reactor-696 > 00:03:21 69387 69607 ?00:03:21 rpc reactor-696 > 00:03:25 69387 69629 ?00:03:25 rpc reactor-696 > 00:03:26 69387 69594 ?00:03:26 rpc reactor-695 > 00:03:34 69387 69595 ?00:03:34 rpc reactor-695 > 00:03:35 69387 69625 ?00:03:35 rpc reactor-696 > 00:03:38 69387 69570 ?00:03:38 rpc reactor-695 > 00:03:38 69387 69620 ?00:03:38 rpc reactor-696 > 00:03:47 69387 69639 ?00:03:47 rpc reactor-696 > 00:03:48 69387 69593 ?00:03:48 rpc reactor-695 > 00:03:49 69387 69591 ?00:03:49 rpc reactor-695 > 00:04:04 69387 69600 ?00:04:04 rpc reactor-696 > 00:07:16 69387 69640 ?00:07:16 rpc reactor-696 > 00:07:39 69387 69616 ?00:07:39 rpc reactor-696 > 00:07:54 69387 69572 ?00:07:54 rpc reactor-695 > 00:09:10 69387 69613 ?00:09:10 rpc reactor-696 > 00:09:28 69387 69567 ?00:09:28 rpc reactor-695 > 00:09:39 69387 69603 ?00:09:39 rpc reactor-696 > 00:09:42 69387 69641 ?00:09:42 rpc reactor-696 > 00:09:59 69387 69604 ?00:09:59 rpc reactor-696 > 00:10:06 69387 69623 ?00:10:06 rpc reactor-696 > 00:10:43 69387 69636 ?00:10:43 rpc reactor-696 > 00:10:59 69387 69642 ?00:10:59 rpc reactor-696 > 00:11:28 69387 69585 ?00:11:28 rpc reactor-695 > 00:12:43 69387 69598 ?00:12:43 rpc reactor-695 > 00:15:42 69387 69578 ?00:15:42 rpc reactor-695 > 00:16:10 69387 69614 ?00:16:10 rpc reactor-696 > 00:17:43 69387 69575 ?00:17:43 rpc reactor-695 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput
[ https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155855#comment-16155855 ] Todd Lipcon commented on KUDU-2086: --- Sure, but round robin also needs to "remember" the assignment in some kind of map. So round robin and "assign to least loaded" are probably equivalent effort to implement. > Uneven assignment of connections to Reactor threads creates skew and limits > transfer throughput > --- > > Key: KUDU-2086 > URL: https://issues.apache.org/jira/browse/KUDU-2086 > Project: Kudu > Issue Type: Bug > Components: rpc >Affects Versions: 1.4.0 >Reporter: Mostafa Mokhtar >Assignee: Michael Ho > > Uneven assignment of connections to Reactor threads causes a couple of > reactor threads to run @100% which limits overall system throughput. > Increasing the number of reactor threads alleviate the problem but some > threads are still running much hotter than others. > Snapshot below is from a 20 node cluster > {code} > ps -T -p 69387 | grep rpc | grep -v "00:00" | awk '{print $4,$0}' | sort > 00:03:17 69387 69596 ?00:03:17 rpc reactor-695 > 00:03:20 69387 69632 ?00:03:20 rpc reactor-696 > 00:03:21 69387 69607 ?00:03:21 rpc reactor-696 > 00:03:25 69387 69629 ?00:03:25 rpc reactor-696 > 00:03:26 69387 69594 ?00:03:26 rpc reactor-695 > 00:03:34 69387 69595 ?00:03:34 rpc reactor-695 > 00:03:35 69387 69625 ?00:03:35 rpc reactor-696 > 00:03:38 69387 69570 ?00:03:38 rpc reactor-695 > 00:03:38 69387 69620 ?00:03:38 rpc reactor-696 > 00:03:47 69387 69639 ?00:03:47 rpc reactor-696 > 00:03:48 69387 69593 ?00:03:48 rpc reactor-695 > 00:03:49 69387 69591 ?00:03:49 rpc reactor-695 > 00:04:04 69387 69600 ?00:04:04 rpc reactor-696 > 00:07:16 69387 69640 ?00:07:16 rpc reactor-696 > 00:07:39 69387 69616 ?00:07:39 rpc reactor-696 > 00:07:54 69387 69572 ?00:07:54 rpc reactor-695 > 00:09:10 69387 69613 ?00:09:10 rpc reactor-696 > 00:09:28 69387 69567 ?00:09:28 rpc reactor-695 > 00:09:39 69387 69603 ?00:09:39 rpc reactor-696 > 00:09:42 69387 69641 ?00:09:42 rpc reactor-696 > 00:09:59 69387 69604 ?00:09:59 rpc reactor-696 > 00:10:06 69387 69623 ?00:10:06 rpc reactor-696 > 00:10:43 69387 69636 ?00:10:43 rpc reactor-696 > 00:10:59 69387 69642 ?00:10:59 rpc reactor-696 > 00:11:28 69387 69585 ?00:11:28 rpc reactor-695 > 00:12:43 69387 69598 ?00:12:43 rpc reactor-695 > 00:15:42 69387 69578 ?00:15:42 rpc reactor-695 > 00:16:10 69387 69614 ?00:16:10 rpc reactor-696 > 00:17:43 69387 69575 ?00:17:43 rpc reactor-695 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput
[ https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154663#comment-16154663 ] Mostafa Mokhtar commented on KUDU-2086: --- [~tlipcon] What about switching to round robin distribution? > Uneven assignment of connections to Reactor threads creates skew and limits > transfer throughput > --- > > Key: KUDU-2086 > URL: https://issues.apache.org/jira/browse/KUDU-2086 > Project: Kudu > Issue Type: Bug > Components: rpc >Affects Versions: 1.4.0 >Reporter: Mostafa Mokhtar >Assignee: Michael Ho > > Uneven assignment of connections to Reactor threads causes a couple of > reactor threads to run @100% which limits overall system throughput. > Increasing the number of reactor threads alleviate the problem but some > threads are still running much hotter than others. > Snapshot below is from a 20 node cluster > {code} > ps -T -p 69387 | grep rpc | grep -v "00:00" | awk '{print $4,$0}' | sort > 00:03:17 69387 69596 ?00:03:17 rpc reactor-695 > 00:03:20 69387 69632 ?00:03:20 rpc reactor-696 > 00:03:21 69387 69607 ?00:03:21 rpc reactor-696 > 00:03:25 69387 69629 ?00:03:25 rpc reactor-696 > 00:03:26 69387 69594 ?00:03:26 rpc reactor-695 > 00:03:34 69387 69595 ?00:03:34 rpc reactor-695 > 00:03:35 69387 69625 ?00:03:35 rpc reactor-696 > 00:03:38 69387 69570 ?00:03:38 rpc reactor-695 > 00:03:38 69387 69620 ?00:03:38 rpc reactor-696 > 00:03:47 69387 69639 ?00:03:47 rpc reactor-696 > 00:03:48 69387 69593 ?00:03:48 rpc reactor-695 > 00:03:49 69387 69591 ?00:03:49 rpc reactor-695 > 00:04:04 69387 69600 ?00:04:04 rpc reactor-696 > 00:07:16 69387 69640 ?00:07:16 rpc reactor-696 > 00:07:39 69387 69616 ?00:07:39 rpc reactor-696 > 00:07:54 69387 69572 ?00:07:54 rpc reactor-695 > 00:09:10 69387 69613 ?00:09:10 rpc reactor-696 > 00:09:28 69387 69567 ?00:09:28 rpc reactor-695 > 00:09:39 69387 69603 ?00:09:39 rpc reactor-696 > 00:09:42 69387 69641 ?00:09:42 rpc reactor-696 > 00:09:59 69387 69604 ?00:09:59 rpc reactor-696 > 00:10:06 69387 69623 ?00:10:06 rpc reactor-696 > 00:10:43 69387 69636 ?00:10:43 rpc reactor-696 > 00:10:59 69387 69642 ?00:10:59 rpc reactor-696 > 00:11:28 69387 69585 ?00:11:28 rpc reactor-695 > 00:12:43 69387 69598 ?00:12:43 rpc reactor-695 > 00:15:42 69387 69578 ?00:15:42 rpc reactor-695 > 00:16:10 69387 69614 ?00:16:10 rpc reactor-696 > 00:17:43 69387 69575 ?00:17:43 rpc reactor-695 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput
[ https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154646#comment-16154646 ] Todd Lipcon commented on KUDU-2086: --- [~sailesh] and I chatted about this a bit this afternoon by IM. I don't think it's an issue with the hash code -- even with a "perfect" hash code (ie exactly random) we are likely to see skew. The reason here is that we are defining skew as max(# connections in a reactor) / average(# connections in a reactor). The "# connections in a reactor" variable has a binomial distribution. If you sample a bunch of times from a binomial distribution and take the max over those samples, that max is likely to be much higher than the mean (see "order statistics" on wikipedia for more details). I ran a simple Python simulation as well: {code} import numpy as np import pandas as pd import random from collections import Counter num_reactors = 24 num_nodes = 100 num_trials = 5000 trial_results = [] for trial in xrange(num_trials): assignments = [random.randint(0, num_reactors) for x in xrange(num_nodes)] reactor_counts = Counter(assignments).values() worst_to_avg = max(reactor_counts) / np.average(reactor_counts) trial_results.append(worst_to_avg) pd.Series(trial_results).hist(bins=40) {code} which runs a lot of simulated trials with a perfect hash function and plots the distribution of observed skew (max/mean). The resulting distribution looks like: !https://ibin.co/3ZOmzYwLIzeq.png! ie most of the time, we expect to see a skew around 2x, which more or less matches what we see experimentally in the Impala use case. So, if we want to reduce skew, we need to do explicit assignment/balancing rather than random stateless assignment using hashes. > Uneven assignment of connections to Reactor threads creates skew and limits > transfer throughput > --- > > Key: KUDU-2086 > URL: https://issues.apache.org/jira/browse/KUDU-2086 > Project: Kudu > Issue Type: Bug > Components: rpc >Affects Versions: 1.4.0 >Reporter: Mostafa Mokhtar >Assignee: Michael Ho > > Uneven assignment of connections to Reactor threads causes a couple of > reactor threads to run @100% which limits overall system throughput. > Increasing the number of reactor threads alleviate the problem but some > threads are still running much hotter than others. > Snapshot below is from a 20 node cluster > {code} > ps -T -p 69387 | grep rpc | grep -v "00:00" | awk '{print $4,$0}' | sort > 00:03:17 69387 69596 ?00:03:17 rpc reactor-695 > 00:03:20 69387 69632 ?00:03:20 rpc reactor-696 > 00:03:21 69387 69607 ?00:03:21 rpc reactor-696 > 00:03:25 69387 69629 ?00:03:25 rpc reactor-696 > 00:03:26 69387 69594 ?00:03:26 rpc reactor-695 > 00:03:34 69387 69595 ?00:03:34 rpc reactor-695 > 00:03:35 69387 69625 ?00:03:35 rpc reactor-696 > 00:03:38 69387 69570 ?00:03:38 rpc reactor-695 > 00:03:38 69387 69620 ?00:03:38 rpc reactor-696 > 00:03:47 69387 69639 ?00:03:47 rpc reactor-696 > 00:03:48 69387 69593 ?00:03:48 rpc reactor-695 > 00:03:49 69387 69591 ?00:03:49 rpc reactor-695 > 00:04:04 69387 69600 ?00:04:04 rpc reactor-696 > 00:07:16 69387 69640 ?00:07:16 rpc reactor-696 > 00:07:39 69387 69616 ?00:07:39 rpc reactor-696 > 00:07:54 69387 69572 ?00:07:54 rpc reactor-695 > 00:09:10 69387 69613 ?00:09:10 rpc reactor-696 > 00:09:28 69387 69567 ?00:09:28 rpc reactor-695 > 00:09:39 69387 69603 ?00:09:39 rpc reactor-696 > 00:09:42 69387 69641 ?00:09:42 rpc reactor-696 > 00:09:59 69387 69604 ?00:09:59 rpc reactor-696 > 00:10:06 69387 69623 ?00:10:06 rpc reactor-696 > 00:10:43 69387 69636 ?00:10:43 rpc reactor-696 > 00:10:59 69387 69642 ?00:10:59 rpc reactor-696 > 00:11:28 69387 69585 ?00:11:28 rpc reactor-695 > 00:12:43 69387 69598 ?00:12:43 rpc reactor-695 > 00:15:42 69387 69578 ?00:15:42 rpc reactor-695 > 00:16:10 69387 69614 ?00:16:10 rpc reactor-696 > 00:17:43 69387 69575 ?00:17:43 rpc reactor-695 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput
[ https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147745#comment-16147745 ] Michael Ho commented on KUDU-2086: -- Actually, I wonder if it has to do with the endianness. Network address is usually represented as big endian so the contiguous range of IP addresses would actually differ in the most significant byte (in a 32-bit integer) when represented as little endian. Need to run some simple experiments to verify the behavior. > Uneven assignment of connections to Reactor threads creates skew and limits > transfer throughput > --- > > Key: KUDU-2086 > URL: https://issues.apache.org/jira/browse/KUDU-2086 > Project: Kudu > Issue Type: Bug > Components: rpc >Affects Versions: 1.4.0 >Reporter: Mostafa Mokhtar >Assignee: Michael Ho > > Uneven assignment of connections to Reactor threads causes a couple of > reactor threads to run @100% which limits overall system throughput. > Increasing the number of reactor threads alleviate the problem but some > threads are still running much hotter than others. > Snapshot below is from a 20 node cluster > {code} > ps -T -p 69387 | grep rpc | grep -v "00:00" | awk '{print $4,$0}' | sort > 00:03:17 69387 69596 ?00:03:17 rpc reactor-695 > 00:03:20 69387 69632 ?00:03:20 rpc reactor-696 > 00:03:21 69387 69607 ?00:03:21 rpc reactor-696 > 00:03:25 69387 69629 ?00:03:25 rpc reactor-696 > 00:03:26 69387 69594 ?00:03:26 rpc reactor-695 > 00:03:34 69387 69595 ?00:03:34 rpc reactor-695 > 00:03:35 69387 69625 ?00:03:35 rpc reactor-696 > 00:03:38 69387 69570 ?00:03:38 rpc reactor-695 > 00:03:38 69387 69620 ?00:03:38 rpc reactor-696 > 00:03:47 69387 69639 ?00:03:47 rpc reactor-696 > 00:03:48 69387 69593 ?00:03:48 rpc reactor-695 > 00:03:49 69387 69591 ?00:03:49 rpc reactor-695 > 00:04:04 69387 69600 ?00:04:04 rpc reactor-696 > 00:07:16 69387 69640 ?00:07:16 rpc reactor-696 > 00:07:39 69387 69616 ?00:07:39 rpc reactor-696 > 00:07:54 69387 69572 ?00:07:54 rpc reactor-695 > 00:09:10 69387 69613 ?00:09:10 rpc reactor-696 > 00:09:28 69387 69567 ?00:09:28 rpc reactor-695 > 00:09:39 69387 69603 ?00:09:39 rpc reactor-696 > 00:09:42 69387 69641 ?00:09:42 rpc reactor-696 > 00:09:59 69387 69604 ?00:09:59 rpc reactor-696 > 00:10:06 69387 69623 ?00:10:06 rpc reactor-696 > 00:10:43 69387 69636 ?00:10:43 rpc reactor-696 > 00:10:59 69387 69642 ?00:10:59 rpc reactor-696 > 00:11:28 69387 69585 ?00:11:28 rpc reactor-695 > 00:12:43 69387 69598 ?00:12:43 rpc reactor-695 > 00:15:42 69387 69578 ?00:15:42 rpc reactor-695 > 00:16:10 69387 69614 ?00:16:10 rpc reactor-696 > 00:17:43 69387 69575 ?00:17:43 rpc reactor-695 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput
[ https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143012#comment-16143012 ] Michael Ho commented on KUDU-2086: -- I suppose the IP addresses range in your case are contiguous, right ? I wonder if the hash values difference for different IP addresses are in the high bits so doing modulus below doesn't quite spread them out. {noformat} uint32_t hashCode = remote.HashCode(); int reactor_idx = hashCode % reactors_.size(); {noformat} > Uneven assignment of connections to Reactor threads creates skew and limits > transfer throughput > --- > > Key: KUDU-2086 > URL: https://issues.apache.org/jira/browse/KUDU-2086 > Project: Kudu > Issue Type: Bug > Components: rpc >Affects Versions: 1.4.0 >Reporter: Mostafa Mokhtar > > Uneven assignment of connections to Reactor threads causes a couple of > reactor threads to run @100% which limits overall system throughput. > Increasing the number of reactor threads alleviate the problem but some > threads are still running much hotter than others. > Snapshot below is from a 20 node cluster > {code} > ps -T -p 69387 | grep rpc | grep -v "00:00" | awk '{print $4,$0}' | sort > 00:03:17 69387 69596 ?00:03:17 rpc reactor-695 > 00:03:20 69387 69632 ?00:03:20 rpc reactor-696 > 00:03:21 69387 69607 ?00:03:21 rpc reactor-696 > 00:03:25 69387 69629 ?00:03:25 rpc reactor-696 > 00:03:26 69387 69594 ?00:03:26 rpc reactor-695 > 00:03:34 69387 69595 ?00:03:34 rpc reactor-695 > 00:03:35 69387 69625 ?00:03:35 rpc reactor-696 > 00:03:38 69387 69570 ?00:03:38 rpc reactor-695 > 00:03:38 69387 69620 ?00:03:38 rpc reactor-696 > 00:03:47 69387 69639 ?00:03:47 rpc reactor-696 > 00:03:48 69387 69593 ?00:03:48 rpc reactor-695 > 00:03:49 69387 69591 ?00:03:49 rpc reactor-695 > 00:04:04 69387 69600 ?00:04:04 rpc reactor-696 > 00:07:16 69387 69640 ?00:07:16 rpc reactor-696 > 00:07:39 69387 69616 ?00:07:39 rpc reactor-696 > 00:07:54 69387 69572 ?00:07:54 rpc reactor-695 > 00:09:10 69387 69613 ?00:09:10 rpc reactor-696 > 00:09:28 69387 69567 ?00:09:28 rpc reactor-695 > 00:09:39 69387 69603 ?00:09:39 rpc reactor-696 > 00:09:42 69387 69641 ?00:09:42 rpc reactor-696 > 00:09:59 69387 69604 ?00:09:59 rpc reactor-696 > 00:10:06 69387 69623 ?00:10:06 rpc reactor-696 > 00:10:43 69387 69636 ?00:10:43 rpc reactor-696 > 00:10:59 69387 69642 ?00:10:59 rpc reactor-696 > 00:11:28 69387 69585 ?00:11:28 rpc reactor-695 > 00:12:43 69387 69598 ?00:12:43 rpc reactor-695 > 00:15:42 69387 69578 ?00:15:42 rpc reactor-695 > 00:16:10 69387 69614 ?00:16:10 rpc reactor-696 > 00:17:43 69387 69575 ?00:17:43 rpc reactor-695 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)