[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput

2018-03-22 Thread Mostafa Mokhtar (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1641#comment-1641
 ] 

Mostafa Mokhtar commented on KUDU-2086:
---

Higher number of reactor threads and reduced tcmalloc contention in the reactor 
thread code path alleviated the issue.

> Uneven assignment of connections to Reactor threads creates skew and limits 
> transfer throughput
> ---
>
> Key: KUDU-2086
> URL: https://issues.apache.org/jira/browse/KUDU-2086
> Project: Kudu
>  Issue Type: Improvement
>  Components: rpc
>Affects Versions: 1.4.0
>Reporter: Mostafa Mokhtar
>Assignee: Joe McDonnell
>Priority: Major
> Attachments: krpc_hash_test.c
>
>
> Uneven assignment of connections to Reactor threads causes a couple of 
> reactor threads to run @100% which limits overall system throughput.
> Increasing the number of reactor threads alleviate the problem but some 
> threads are still running much hotter than others.
> Snapshot below is from a 20 node cluster
> {code}
> ps -T -p 69387 | grep rpc |  grep -v "00:00"  | awk '{print $4,$0}' | sort
> 00:03:17  69387  69596 ?00:03:17 rpc reactor-695
> 00:03:20  69387  69632 ?00:03:20 rpc reactor-696
> 00:03:21  69387  69607 ?00:03:21 rpc reactor-696
> 00:03:25  69387  69629 ?00:03:25 rpc reactor-696
> 00:03:26  69387  69594 ?00:03:26 rpc reactor-695
> 00:03:34  69387  69595 ?00:03:34 rpc reactor-695
> 00:03:35  69387  69625 ?00:03:35 rpc reactor-696
> 00:03:38  69387  69570 ?00:03:38 rpc reactor-695
> 00:03:38  69387  69620 ?00:03:38 rpc reactor-696
> 00:03:47  69387  69639 ?00:03:47 rpc reactor-696
> 00:03:48  69387  69593 ?00:03:48 rpc reactor-695
> 00:03:49  69387  69591 ?00:03:49 rpc reactor-695
> 00:04:04  69387  69600 ?00:04:04 rpc reactor-696
> 00:07:16  69387  69640 ?00:07:16 rpc reactor-696
> 00:07:39  69387  69616 ?00:07:39 rpc reactor-696
> 00:07:54  69387  69572 ?00:07:54 rpc reactor-695
> 00:09:10  69387  69613 ?00:09:10 rpc reactor-696
> 00:09:28  69387  69567 ?00:09:28 rpc reactor-695
> 00:09:39  69387  69603 ?00:09:39 rpc reactor-696
> 00:09:42  69387  69641 ?00:09:42 rpc reactor-696
> 00:09:59  69387  69604 ?00:09:59 rpc reactor-696
> 00:10:06  69387  69623 ?00:10:06 rpc reactor-696
> 00:10:43  69387  69636 ?00:10:43 rpc reactor-696
> 00:10:59  69387  69642 ?00:10:59 rpc reactor-696
> 00:11:28  69387  69585 ?00:11:28 rpc reactor-695
> 00:12:43  69387  69598 ?00:12:43 rpc reactor-695
> 00:15:42  69387  69578 ?00:15:42 rpc reactor-695
> 00:16:10  69387  69614 ?00:16:10 rpc reactor-696
> 00:17:43  69387  69575 ?00:17:43 rpc reactor-695
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput

2018-03-22 Thread Joe McDonnell (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409830#comment-16409830
 ] 

Joe McDonnell commented on KUDU-2086:
-

[~tlipcon] Good point, I changed this to an Improvement and dropped the 
priority.

> Uneven assignment of connections to Reactor threads creates skew and limits 
> transfer throughput
> ---
>
> Key: KUDU-2086
> URL: https://issues.apache.org/jira/browse/KUDU-2086
> Project: Kudu
>  Issue Type: Improvement
>  Components: rpc
>Affects Versions: 1.4.0
>Reporter: Mostafa Mokhtar
>Assignee: Joe McDonnell
>Priority: Major
> Attachments: krpc_hash_test.c
>
>
> Uneven assignment of connections to Reactor threads causes a couple of 
> reactor threads to run @100% which limits overall system throughput.
> Increasing the number of reactor threads alleviate the problem but some 
> threads are still running much hotter than others.
> Snapshot below is from a 20 node cluster
> {code}
> ps -T -p 69387 | grep rpc |  grep -v "00:00"  | awk '{print $4,$0}' | sort
> 00:03:17  69387  69596 ?00:03:17 rpc reactor-695
> 00:03:20  69387  69632 ?00:03:20 rpc reactor-696
> 00:03:21  69387  69607 ?00:03:21 rpc reactor-696
> 00:03:25  69387  69629 ?00:03:25 rpc reactor-696
> 00:03:26  69387  69594 ?00:03:26 rpc reactor-695
> 00:03:34  69387  69595 ?00:03:34 rpc reactor-695
> 00:03:35  69387  69625 ?00:03:35 rpc reactor-696
> 00:03:38  69387  69570 ?00:03:38 rpc reactor-695
> 00:03:38  69387  69620 ?00:03:38 rpc reactor-696
> 00:03:47  69387  69639 ?00:03:47 rpc reactor-696
> 00:03:48  69387  69593 ?00:03:48 rpc reactor-695
> 00:03:49  69387  69591 ?00:03:49 rpc reactor-695
> 00:04:04  69387  69600 ?00:04:04 rpc reactor-696
> 00:07:16  69387  69640 ?00:07:16 rpc reactor-696
> 00:07:39  69387  69616 ?00:07:39 rpc reactor-696
> 00:07:54  69387  69572 ?00:07:54 rpc reactor-695
> 00:09:10  69387  69613 ?00:09:10 rpc reactor-696
> 00:09:28  69387  69567 ?00:09:28 rpc reactor-695
> 00:09:39  69387  69603 ?00:09:39 rpc reactor-696
> 00:09:42  69387  69641 ?00:09:42 rpc reactor-696
> 00:09:59  69387  69604 ?00:09:59 rpc reactor-696
> 00:10:06  69387  69623 ?00:10:06 rpc reactor-696
> 00:10:43  69387  69636 ?00:10:43 rpc reactor-696
> 00:10:59  69387  69642 ?00:10:59 rpc reactor-696
> 00:11:28  69387  69585 ?00:11:28 rpc reactor-695
> 00:12:43  69387  69598 ?00:12:43 rpc reactor-695
> 00:15:42  69387  69578 ?00:15:42 rpc reactor-695
> 00:16:10  69387  69614 ?00:16:10 rpc reactor-696
> 00:17:43  69387  69575 ?00:17:43 rpc reactor-695
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput

2018-03-21 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408876#comment-16408876
 ] 

Todd Lipcon commented on KUDU-2086:
---

[~joemcdonnell] afaik this isn't really an issue anymore. Perhaps we should 
drop it to backburner priority and classify it as an improvement rather than a 
bug?

> Uneven assignment of connections to Reactor threads creates skew and limits 
> transfer throughput
> ---
>
> Key: KUDU-2086
> URL: https://issues.apache.org/jira/browse/KUDU-2086
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.4.0
>Reporter: Mostafa Mokhtar
>Assignee: Joe McDonnell
>Priority: Critical
> Attachments: krpc_hash_test.c
>
>
> Uneven assignment of connections to Reactor threads causes a couple of 
> reactor threads to run @100% which limits overall system throughput.
> Increasing the number of reactor threads alleviate the problem but some 
> threads are still running much hotter than others.
> Snapshot below is from a 20 node cluster
> {code}
> ps -T -p 69387 | grep rpc |  grep -v "00:00"  | awk '{print $4,$0}' | sort
> 00:03:17  69387  69596 ?00:03:17 rpc reactor-695
> 00:03:20  69387  69632 ?00:03:20 rpc reactor-696
> 00:03:21  69387  69607 ?00:03:21 rpc reactor-696
> 00:03:25  69387  69629 ?00:03:25 rpc reactor-696
> 00:03:26  69387  69594 ?00:03:26 rpc reactor-695
> 00:03:34  69387  69595 ?00:03:34 rpc reactor-695
> 00:03:35  69387  69625 ?00:03:35 rpc reactor-696
> 00:03:38  69387  69570 ?00:03:38 rpc reactor-695
> 00:03:38  69387  69620 ?00:03:38 rpc reactor-696
> 00:03:47  69387  69639 ?00:03:47 rpc reactor-696
> 00:03:48  69387  69593 ?00:03:48 rpc reactor-695
> 00:03:49  69387  69591 ?00:03:49 rpc reactor-695
> 00:04:04  69387  69600 ?00:04:04 rpc reactor-696
> 00:07:16  69387  69640 ?00:07:16 rpc reactor-696
> 00:07:39  69387  69616 ?00:07:39 rpc reactor-696
> 00:07:54  69387  69572 ?00:07:54 rpc reactor-695
> 00:09:10  69387  69613 ?00:09:10 rpc reactor-696
> 00:09:28  69387  69567 ?00:09:28 rpc reactor-695
> 00:09:39  69387  69603 ?00:09:39 rpc reactor-696
> 00:09:42  69387  69641 ?00:09:42 rpc reactor-696
> 00:09:59  69387  69604 ?00:09:59 rpc reactor-696
> 00:10:06  69387  69623 ?00:10:06 rpc reactor-696
> 00:10:43  69387  69636 ?00:10:43 rpc reactor-696
> 00:10:59  69387  69642 ?00:10:59 rpc reactor-696
> 00:11:28  69387  69585 ?00:11:28 rpc reactor-695
> 00:12:43  69387  69598 ?00:12:43 rpc reactor-695
> 00:15:42  69387  69578 ?00:15:42 rpc reactor-695
> 00:16:10  69387  69614 ?00:16:10 rpc reactor-696
> 00:17:43  69387  69575 ?00:17:43 rpc reactor-695
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput

2018-01-18 Thread Joe McDonnell (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331492#comment-16331492
 ] 

Joe McDonnell commented on KUDU-2086:
-

[~tlipcon] Good point, this is not a blocker. I will lower the priority.

> Uneven assignment of connections to Reactor threads creates skew and limits 
> transfer throughput
> ---
>
> Key: KUDU-2086
> URL: https://issues.apache.org/jira/browse/KUDU-2086
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.4.0
>Reporter: Mostafa Mokhtar
>Assignee: Joe McDonnell
>Priority: Blocker
> Attachments: krpc_hash_test.c
>
>
> Uneven assignment of connections to Reactor threads causes a couple of 
> reactor threads to run @100% which limits overall system throughput.
> Increasing the number of reactor threads alleviate the problem but some 
> threads are still running much hotter than others.
> Snapshot below is from a 20 node cluster
> {code}
> ps -T -p 69387 | grep rpc |  grep -v "00:00"  | awk '{print $4,$0}' | sort
> 00:03:17  69387  69596 ?00:03:17 rpc reactor-695
> 00:03:20  69387  69632 ?00:03:20 rpc reactor-696
> 00:03:21  69387  69607 ?00:03:21 rpc reactor-696
> 00:03:25  69387  69629 ?00:03:25 rpc reactor-696
> 00:03:26  69387  69594 ?00:03:26 rpc reactor-695
> 00:03:34  69387  69595 ?00:03:34 rpc reactor-695
> 00:03:35  69387  69625 ?00:03:35 rpc reactor-696
> 00:03:38  69387  69570 ?00:03:38 rpc reactor-695
> 00:03:38  69387  69620 ?00:03:38 rpc reactor-696
> 00:03:47  69387  69639 ?00:03:47 rpc reactor-696
> 00:03:48  69387  69593 ?00:03:48 rpc reactor-695
> 00:03:49  69387  69591 ?00:03:49 rpc reactor-695
> 00:04:04  69387  69600 ?00:04:04 rpc reactor-696
> 00:07:16  69387  69640 ?00:07:16 rpc reactor-696
> 00:07:39  69387  69616 ?00:07:39 rpc reactor-696
> 00:07:54  69387  69572 ?00:07:54 rpc reactor-695
> 00:09:10  69387  69613 ?00:09:10 rpc reactor-696
> 00:09:28  69387  69567 ?00:09:28 rpc reactor-695
> 00:09:39  69387  69603 ?00:09:39 rpc reactor-696
> 00:09:42  69387  69641 ?00:09:42 rpc reactor-696
> 00:09:59  69387  69604 ?00:09:59 rpc reactor-696
> 00:10:06  69387  69623 ?00:10:06 rpc reactor-696
> 00:10:43  69387  69636 ?00:10:43 rpc reactor-696
> 00:10:59  69387  69642 ?00:10:59 rpc reactor-696
> 00:11:28  69387  69585 ?00:11:28 rpc reactor-695
> 00:12:43  69387  69598 ?00:12:43 rpc reactor-695
> 00:15:42  69387  69578 ?00:15:42 rpc reactor-695
> 00:16:10  69387  69614 ?00:16:10 rpc reactor-696
> 00:17:43  69387  69575 ?00:17:43 rpc reactor-695
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput

2018-01-18 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331487#comment-16331487
 ] 

Todd Lipcon commented on KUDU-2086:
---

[~joemcdonnell] is this still a blocker? From talking with [~mmokhtar] offline 
recently it sounds like some changes went into Impala that drastically reduced 
the load on the reactor threads to the point that it isn't a big problem 
anymore.

Might still be worth doing this eventually but we try to reserve blocker 
priority for serious issues like data loss.

> Uneven assignment of connections to Reactor threads creates skew and limits 
> transfer throughput
> ---
>
> Key: KUDU-2086
> URL: https://issues.apache.org/jira/browse/KUDU-2086
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.4.0
>Reporter: Mostafa Mokhtar
>Assignee: Joe McDonnell
>Priority: Blocker
> Attachments: krpc_hash_test.c
>
>
> Uneven assignment of connections to Reactor threads causes a couple of 
> reactor threads to run @100% which limits overall system throughput.
> Increasing the number of reactor threads alleviate the problem but some 
> threads are still running much hotter than others.
> Snapshot below is from a 20 node cluster
> {code}
> ps -T -p 69387 | grep rpc |  grep -v "00:00"  | awk '{print $4,$0}' | sort
> 00:03:17  69387  69596 ?00:03:17 rpc reactor-695
> 00:03:20  69387  69632 ?00:03:20 rpc reactor-696
> 00:03:21  69387  69607 ?00:03:21 rpc reactor-696
> 00:03:25  69387  69629 ?00:03:25 rpc reactor-696
> 00:03:26  69387  69594 ?00:03:26 rpc reactor-695
> 00:03:34  69387  69595 ?00:03:34 rpc reactor-695
> 00:03:35  69387  69625 ?00:03:35 rpc reactor-696
> 00:03:38  69387  69570 ?00:03:38 rpc reactor-695
> 00:03:38  69387  69620 ?00:03:38 rpc reactor-696
> 00:03:47  69387  69639 ?00:03:47 rpc reactor-696
> 00:03:48  69387  69593 ?00:03:48 rpc reactor-695
> 00:03:49  69387  69591 ?00:03:49 rpc reactor-695
> 00:04:04  69387  69600 ?00:04:04 rpc reactor-696
> 00:07:16  69387  69640 ?00:07:16 rpc reactor-696
> 00:07:39  69387  69616 ?00:07:39 rpc reactor-696
> 00:07:54  69387  69572 ?00:07:54 rpc reactor-695
> 00:09:10  69387  69613 ?00:09:10 rpc reactor-696
> 00:09:28  69387  69567 ?00:09:28 rpc reactor-695
> 00:09:39  69387  69603 ?00:09:39 rpc reactor-696
> 00:09:42  69387  69641 ?00:09:42 rpc reactor-696
> 00:09:59  69387  69604 ?00:09:59 rpc reactor-696
> 00:10:06  69387  69623 ?00:10:06 rpc reactor-696
> 00:10:43  69387  69636 ?00:10:43 rpc reactor-696
> 00:10:59  69387  69642 ?00:10:59 rpc reactor-696
> 00:11:28  69387  69585 ?00:11:28 rpc reactor-695
> 00:12:43  69387  69598 ?00:12:43 rpc reactor-695
> 00:15:42  69387  69578 ?00:15:42 rpc reactor-695
> 00:16:10  69387  69614 ?00:16:10 rpc reactor-696
> 00:17:43  69387  69575 ?00:17:43 rpc reactor-695
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput

2017-09-06 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155855#comment-16155855
 ] 

Todd Lipcon commented on KUDU-2086:
---

Sure, but round robin also needs to "remember" the assignment in some kind of 
map. So round robin and "assign to least loaded" are probably equivalent effort 
to implement.

> Uneven assignment of connections to Reactor threads creates skew and limits 
> transfer throughput
> ---
>
> Key: KUDU-2086
> URL: https://issues.apache.org/jira/browse/KUDU-2086
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.4.0
>Reporter: Mostafa Mokhtar
>Assignee: Michael Ho
>
> Uneven assignment of connections to Reactor threads causes a couple of 
> reactor threads to run @100% which limits overall system throughput.
> Increasing the number of reactor threads alleviate the problem but some 
> threads are still running much hotter than others.
> Snapshot below is from a 20 node cluster
> {code}
> ps -T -p 69387 | grep rpc |  grep -v "00:00"  | awk '{print $4,$0}' | sort
> 00:03:17  69387  69596 ?00:03:17 rpc reactor-695
> 00:03:20  69387  69632 ?00:03:20 rpc reactor-696
> 00:03:21  69387  69607 ?00:03:21 rpc reactor-696
> 00:03:25  69387  69629 ?00:03:25 rpc reactor-696
> 00:03:26  69387  69594 ?00:03:26 rpc reactor-695
> 00:03:34  69387  69595 ?00:03:34 rpc reactor-695
> 00:03:35  69387  69625 ?00:03:35 rpc reactor-696
> 00:03:38  69387  69570 ?00:03:38 rpc reactor-695
> 00:03:38  69387  69620 ?00:03:38 rpc reactor-696
> 00:03:47  69387  69639 ?00:03:47 rpc reactor-696
> 00:03:48  69387  69593 ?00:03:48 rpc reactor-695
> 00:03:49  69387  69591 ?00:03:49 rpc reactor-695
> 00:04:04  69387  69600 ?00:04:04 rpc reactor-696
> 00:07:16  69387  69640 ?00:07:16 rpc reactor-696
> 00:07:39  69387  69616 ?00:07:39 rpc reactor-696
> 00:07:54  69387  69572 ?00:07:54 rpc reactor-695
> 00:09:10  69387  69613 ?00:09:10 rpc reactor-696
> 00:09:28  69387  69567 ?00:09:28 rpc reactor-695
> 00:09:39  69387  69603 ?00:09:39 rpc reactor-696
> 00:09:42  69387  69641 ?00:09:42 rpc reactor-696
> 00:09:59  69387  69604 ?00:09:59 rpc reactor-696
> 00:10:06  69387  69623 ?00:10:06 rpc reactor-696
> 00:10:43  69387  69636 ?00:10:43 rpc reactor-696
> 00:10:59  69387  69642 ?00:10:59 rpc reactor-696
> 00:11:28  69387  69585 ?00:11:28 rpc reactor-695
> 00:12:43  69387  69598 ?00:12:43 rpc reactor-695
> 00:15:42  69387  69578 ?00:15:42 rpc reactor-695
> 00:16:10  69387  69614 ?00:16:10 rpc reactor-696
> 00:17:43  69387  69575 ?00:17:43 rpc reactor-695
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput

2017-09-05 Thread Mostafa Mokhtar (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154663#comment-16154663
 ] 

Mostafa Mokhtar commented on KUDU-2086:
---

[~tlipcon]

What about switching to round robin distribution?

> Uneven assignment of connections to Reactor threads creates skew and limits 
> transfer throughput
> ---
>
> Key: KUDU-2086
> URL: https://issues.apache.org/jira/browse/KUDU-2086
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.4.0
>Reporter: Mostafa Mokhtar
>Assignee: Michael Ho
>
> Uneven assignment of connections to Reactor threads causes a couple of 
> reactor threads to run @100% which limits overall system throughput.
> Increasing the number of reactor threads alleviate the problem but some 
> threads are still running much hotter than others.
> Snapshot below is from a 20 node cluster
> {code}
> ps -T -p 69387 | grep rpc |  grep -v "00:00"  | awk '{print $4,$0}' | sort
> 00:03:17  69387  69596 ?00:03:17 rpc reactor-695
> 00:03:20  69387  69632 ?00:03:20 rpc reactor-696
> 00:03:21  69387  69607 ?00:03:21 rpc reactor-696
> 00:03:25  69387  69629 ?00:03:25 rpc reactor-696
> 00:03:26  69387  69594 ?00:03:26 rpc reactor-695
> 00:03:34  69387  69595 ?00:03:34 rpc reactor-695
> 00:03:35  69387  69625 ?00:03:35 rpc reactor-696
> 00:03:38  69387  69570 ?00:03:38 rpc reactor-695
> 00:03:38  69387  69620 ?00:03:38 rpc reactor-696
> 00:03:47  69387  69639 ?00:03:47 rpc reactor-696
> 00:03:48  69387  69593 ?00:03:48 rpc reactor-695
> 00:03:49  69387  69591 ?00:03:49 rpc reactor-695
> 00:04:04  69387  69600 ?00:04:04 rpc reactor-696
> 00:07:16  69387  69640 ?00:07:16 rpc reactor-696
> 00:07:39  69387  69616 ?00:07:39 rpc reactor-696
> 00:07:54  69387  69572 ?00:07:54 rpc reactor-695
> 00:09:10  69387  69613 ?00:09:10 rpc reactor-696
> 00:09:28  69387  69567 ?00:09:28 rpc reactor-695
> 00:09:39  69387  69603 ?00:09:39 rpc reactor-696
> 00:09:42  69387  69641 ?00:09:42 rpc reactor-696
> 00:09:59  69387  69604 ?00:09:59 rpc reactor-696
> 00:10:06  69387  69623 ?00:10:06 rpc reactor-696
> 00:10:43  69387  69636 ?00:10:43 rpc reactor-696
> 00:10:59  69387  69642 ?00:10:59 rpc reactor-696
> 00:11:28  69387  69585 ?00:11:28 rpc reactor-695
> 00:12:43  69387  69598 ?00:12:43 rpc reactor-695
> 00:15:42  69387  69578 ?00:15:42 rpc reactor-695
> 00:16:10  69387  69614 ?00:16:10 rpc reactor-696
> 00:17:43  69387  69575 ?00:17:43 rpc reactor-695
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput

2017-09-05 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154646#comment-16154646
 ] 

Todd Lipcon commented on KUDU-2086:
---

[~sailesh] and I chatted about this a bit this afternoon by IM. I don't think 
it's an issue with the hash code -- even with a "perfect" hash code (ie exactly 
random) we are likely to see skew.

The reason here is that we are defining skew as max(# connections in a reactor) 
/ average(# connections in a reactor). The "# connections in a reactor" 
variable has a binomial distribution. If you sample a bunch of times from a 
binomial distribution and take the max over those samples, that max is likely 
to be much higher than the mean (see "order statistics" on wikipedia for more 
details).

I ran a simple Python simulation as well:

{code}
import numpy as np
import pandas as pd
import random
from collections import Counter

num_reactors = 24
num_nodes = 100
num_trials = 5000

trial_results = []
for trial in xrange(num_trials):
  assignments = [random.randint(0, num_reactors) for x in xrange(num_nodes)]
  reactor_counts = Counter(assignments).values()
  worst_to_avg = max(reactor_counts) / np.average(reactor_counts)
  trial_results.append(worst_to_avg)

pd.Series(trial_results).hist(bins=40)
{code}

which runs a lot of simulated trials with a perfect hash function and plots the 
distribution of observed skew (max/mean). The resulting distribution looks like:

!https://ibin.co/3ZOmzYwLIzeq.png!

ie most of the time, we expect to see a skew around 2x, which more or less 
matches what we see experimentally in the Impala use case.

So, if we want to reduce skew, we need to do explicit assignment/balancing 
rather than random stateless assignment using hashes.

> Uneven assignment of connections to Reactor threads creates skew and limits 
> transfer throughput
> ---
>
> Key: KUDU-2086
> URL: https://issues.apache.org/jira/browse/KUDU-2086
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.4.0
>Reporter: Mostafa Mokhtar
>Assignee: Michael Ho
>
> Uneven assignment of connections to Reactor threads causes a couple of 
> reactor threads to run @100% which limits overall system throughput.
> Increasing the number of reactor threads alleviate the problem but some 
> threads are still running much hotter than others.
> Snapshot below is from a 20 node cluster
> {code}
> ps -T -p 69387 | grep rpc |  grep -v "00:00"  | awk '{print $4,$0}' | sort
> 00:03:17  69387  69596 ?00:03:17 rpc reactor-695
> 00:03:20  69387  69632 ?00:03:20 rpc reactor-696
> 00:03:21  69387  69607 ?00:03:21 rpc reactor-696
> 00:03:25  69387  69629 ?00:03:25 rpc reactor-696
> 00:03:26  69387  69594 ?00:03:26 rpc reactor-695
> 00:03:34  69387  69595 ?00:03:34 rpc reactor-695
> 00:03:35  69387  69625 ?00:03:35 rpc reactor-696
> 00:03:38  69387  69570 ?00:03:38 rpc reactor-695
> 00:03:38  69387  69620 ?00:03:38 rpc reactor-696
> 00:03:47  69387  69639 ?00:03:47 rpc reactor-696
> 00:03:48  69387  69593 ?00:03:48 rpc reactor-695
> 00:03:49  69387  69591 ?00:03:49 rpc reactor-695
> 00:04:04  69387  69600 ?00:04:04 rpc reactor-696
> 00:07:16  69387  69640 ?00:07:16 rpc reactor-696
> 00:07:39  69387  69616 ?00:07:39 rpc reactor-696
> 00:07:54  69387  69572 ?00:07:54 rpc reactor-695
> 00:09:10  69387  69613 ?00:09:10 rpc reactor-696
> 00:09:28  69387  69567 ?00:09:28 rpc reactor-695
> 00:09:39  69387  69603 ?00:09:39 rpc reactor-696
> 00:09:42  69387  69641 ?00:09:42 rpc reactor-696
> 00:09:59  69387  69604 ?00:09:59 rpc reactor-696
> 00:10:06  69387  69623 ?00:10:06 rpc reactor-696
> 00:10:43  69387  69636 ?00:10:43 rpc reactor-696
> 00:10:59  69387  69642 ?00:10:59 rpc reactor-696
> 00:11:28  69387  69585 ?00:11:28 rpc reactor-695
> 00:12:43  69387  69598 ?00:12:43 rpc reactor-695
> 00:15:42  69387  69578 ?00:15:42 rpc reactor-695
> 00:16:10  69387  69614 ?00:16:10 rpc reactor-696
> 00:17:43  69387  69575 ?00:17:43 rpc reactor-695
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput

2017-08-30 Thread Michael Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147745#comment-16147745
 ] 

Michael Ho commented on KUDU-2086:
--

Actually, I wonder if it has to do with the endianness. Network address is 
usually represented as big endian so the contiguous range of IP addresses would 
actually differ in the most significant byte (in a 32-bit integer) when 
represented as little endian. Need to run some simple experiments to verify the 
behavior.

> Uneven assignment of connections to Reactor threads creates skew and limits 
> transfer throughput
> ---
>
> Key: KUDU-2086
> URL: https://issues.apache.org/jira/browse/KUDU-2086
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.4.0
>Reporter: Mostafa Mokhtar
>Assignee: Michael Ho
>
> Uneven assignment of connections to Reactor threads causes a couple of 
> reactor threads to run @100% which limits overall system throughput.
> Increasing the number of reactor threads alleviate the problem but some 
> threads are still running much hotter than others.
> Snapshot below is from a 20 node cluster
> {code}
> ps -T -p 69387 | grep rpc |  grep -v "00:00"  | awk '{print $4,$0}' | sort
> 00:03:17  69387  69596 ?00:03:17 rpc reactor-695
> 00:03:20  69387  69632 ?00:03:20 rpc reactor-696
> 00:03:21  69387  69607 ?00:03:21 rpc reactor-696
> 00:03:25  69387  69629 ?00:03:25 rpc reactor-696
> 00:03:26  69387  69594 ?00:03:26 rpc reactor-695
> 00:03:34  69387  69595 ?00:03:34 rpc reactor-695
> 00:03:35  69387  69625 ?00:03:35 rpc reactor-696
> 00:03:38  69387  69570 ?00:03:38 rpc reactor-695
> 00:03:38  69387  69620 ?00:03:38 rpc reactor-696
> 00:03:47  69387  69639 ?00:03:47 rpc reactor-696
> 00:03:48  69387  69593 ?00:03:48 rpc reactor-695
> 00:03:49  69387  69591 ?00:03:49 rpc reactor-695
> 00:04:04  69387  69600 ?00:04:04 rpc reactor-696
> 00:07:16  69387  69640 ?00:07:16 rpc reactor-696
> 00:07:39  69387  69616 ?00:07:39 rpc reactor-696
> 00:07:54  69387  69572 ?00:07:54 rpc reactor-695
> 00:09:10  69387  69613 ?00:09:10 rpc reactor-696
> 00:09:28  69387  69567 ?00:09:28 rpc reactor-695
> 00:09:39  69387  69603 ?00:09:39 rpc reactor-696
> 00:09:42  69387  69641 ?00:09:42 rpc reactor-696
> 00:09:59  69387  69604 ?00:09:59 rpc reactor-696
> 00:10:06  69387  69623 ?00:10:06 rpc reactor-696
> 00:10:43  69387  69636 ?00:10:43 rpc reactor-696
> 00:10:59  69387  69642 ?00:10:59 rpc reactor-696
> 00:11:28  69387  69585 ?00:11:28 rpc reactor-695
> 00:12:43  69387  69598 ?00:12:43 rpc reactor-695
> 00:15:42  69387  69578 ?00:15:42 rpc reactor-695
> 00:16:10  69387  69614 ?00:16:10 rpc reactor-696
> 00:17:43  69387  69575 ?00:17:43 rpc reactor-695
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput

2017-08-27 Thread Michael Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143012#comment-16143012
 ] 

Michael Ho commented on KUDU-2086:
--

I suppose the IP addresses range in your case are contiguous, right ? I wonder 
if the hash values difference for different IP addresses are in the high bits 
so doing modulus below doesn't quite spread them out.

{noformat}
  uint32_t hashCode = remote.HashCode();
  int reactor_idx = hashCode % reactors_.size();
{noformat}

> Uneven assignment of connections to Reactor threads creates skew and limits 
> transfer throughput
> ---
>
> Key: KUDU-2086
> URL: https://issues.apache.org/jira/browse/KUDU-2086
> Project: Kudu
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 1.4.0
>Reporter: Mostafa Mokhtar
>
> Uneven assignment of connections to Reactor threads causes a couple of 
> reactor threads to run @100% which limits overall system throughput.
> Increasing the number of reactor threads alleviate the problem but some 
> threads are still running much hotter than others.
> Snapshot below is from a 20 node cluster
> {code}
> ps -T -p 69387 | grep rpc |  grep -v "00:00"  | awk '{print $4,$0}' | sort
> 00:03:17  69387  69596 ?00:03:17 rpc reactor-695
> 00:03:20  69387  69632 ?00:03:20 rpc reactor-696
> 00:03:21  69387  69607 ?00:03:21 rpc reactor-696
> 00:03:25  69387  69629 ?00:03:25 rpc reactor-696
> 00:03:26  69387  69594 ?00:03:26 rpc reactor-695
> 00:03:34  69387  69595 ?00:03:34 rpc reactor-695
> 00:03:35  69387  69625 ?00:03:35 rpc reactor-696
> 00:03:38  69387  69570 ?00:03:38 rpc reactor-695
> 00:03:38  69387  69620 ?00:03:38 rpc reactor-696
> 00:03:47  69387  69639 ?00:03:47 rpc reactor-696
> 00:03:48  69387  69593 ?00:03:48 rpc reactor-695
> 00:03:49  69387  69591 ?00:03:49 rpc reactor-695
> 00:04:04  69387  69600 ?00:04:04 rpc reactor-696
> 00:07:16  69387  69640 ?00:07:16 rpc reactor-696
> 00:07:39  69387  69616 ?00:07:39 rpc reactor-696
> 00:07:54  69387  69572 ?00:07:54 rpc reactor-695
> 00:09:10  69387  69613 ?00:09:10 rpc reactor-696
> 00:09:28  69387  69567 ?00:09:28 rpc reactor-695
> 00:09:39  69387  69603 ?00:09:39 rpc reactor-696
> 00:09:42  69387  69641 ?00:09:42 rpc reactor-696
> 00:09:59  69387  69604 ?00:09:59 rpc reactor-696
> 00:10:06  69387  69623 ?00:10:06 rpc reactor-696
> 00:10:43  69387  69636 ?00:10:43 rpc reactor-696
> 00:10:59  69387  69642 ?00:10:59 rpc reactor-696
> 00:11:28  69387  69585 ?00:11:28 rpc reactor-695
> 00:12:43  69387  69598 ?00:12:43 rpc reactor-695
> 00:15:42  69387  69578 ?00:15:42 rpc reactor-695
> 00:16:10  69387  69614 ?00:16:10 rpc reactor-696
> 00:17:43  69387  69575 ?00:17:43 rpc reactor-695
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)