[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput
[ https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1641#comment-1641 ] Mostafa Mokhtar commented on KUDU-2086: --- Higher number of reactor threads and reduced tcmalloc contention in the reactor thread code path alleviated the issue. > Uneven assignment of connections to Reactor threads creates skew and limits > transfer throughput > --- > > Key: KUDU-2086 > URL: https://issues.apache.org/jira/browse/KUDU-2086 > Project: Kudu > Issue Type: Improvement > Components: rpc >Affects Versions: 1.4.0 >Reporter: Mostafa Mokhtar >Assignee: Joe McDonnell >Priority: Major > Attachments: krpc_hash_test.c > > > Uneven assignment of connections to Reactor threads causes a couple of > reactor threads to run @100% which limits overall system throughput. > Increasing the number of reactor threads alleviate the problem but some > threads are still running much hotter than others. > Snapshot below is from a 20 node cluster > {code} > ps -T -p 69387 | grep rpc | grep -v "00:00" | awk '{print $4,$0}' | sort > 00:03:17 69387 69596 ?00:03:17 rpc reactor-695 > 00:03:20 69387 69632 ?00:03:20 rpc reactor-696 > 00:03:21 69387 69607 ?00:03:21 rpc reactor-696 > 00:03:25 69387 69629 ?00:03:25 rpc reactor-696 > 00:03:26 69387 69594 ?00:03:26 rpc reactor-695 > 00:03:34 69387 69595 ?00:03:34 rpc reactor-695 > 00:03:35 69387 69625 ?00:03:35 rpc reactor-696 > 00:03:38 69387 69570 ?00:03:38 rpc reactor-695 > 00:03:38 69387 69620 ?00:03:38 rpc reactor-696 > 00:03:47 69387 69639 ?00:03:47 rpc reactor-696 > 00:03:48 69387 69593 ?00:03:48 rpc reactor-695 > 00:03:49 69387 69591 ?00:03:49 rpc reactor-695 > 00:04:04 69387 69600 ?00:04:04 rpc reactor-696 > 00:07:16 69387 69640 ?00:07:16 rpc reactor-696 > 00:07:39 69387 69616 ?00:07:39 rpc reactor-696 > 00:07:54 69387 69572 ?00:07:54 rpc reactor-695 > 00:09:10 69387 69613 ?00:09:10 rpc reactor-696 > 00:09:28 69387 69567 ?00:09:28 rpc reactor-695 > 00:09:39 69387 69603 ?00:09:39 rpc reactor-696 > 00:09:42 69387 69641 ?00:09:42 rpc reactor-696 > 00:09:59 69387 69604 ?00:09:59 rpc reactor-696 > 00:10:06 69387 69623 ?00:10:06 rpc reactor-696 > 00:10:43 69387 69636 ?00:10:43 rpc reactor-696 > 00:10:59 69387 69642 ?00:10:59 rpc reactor-696 > 00:11:28 69387 69585 ?00:11:28 rpc reactor-695 > 00:12:43 69387 69598 ?00:12:43 rpc reactor-695 > 00:15:42 69387 69578 ?00:15:42 rpc reactor-695 > 00:16:10 69387 69614 ?00:16:10 rpc reactor-696 > 00:17:43 69387 69575 ?00:17:43 rpc reactor-695 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KUDU-2355) Uneven assignment of work across
[ https://issues.apache.org/jira/browse/KUDU-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mostafa Mokhtar updated KUDU-2355: -- Summary: Uneven assignment of work across (was: Uneven assignment o) > Uneven assignment of work across > - > > Key: KUDU-2355 > URL: https://issues.apache.org/jira/browse/KUDU-2355 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.8.0 >Reporter: Mostafa Mokhtar >Priority: Major > > While running inserts into a Kudu table I noticed that there is an uneven > assignment of work across maintenance threads, this reduces scalability of > the maintenance as more threads won't always mean better throughput. > |Thread name||Cumulative User CPU(s)||Cumulative Kernel > CPU(s)||Cumulative IO-wait(s)| > |MaintenanceMgr [worker]-59102| 3410.94|352.07| 866.71| > |MaintenanceMgr [worker]-59101| 3056.77|319.32| 794.1| > |MaintenanceMgr [worker]-59100| 2924.29|300.35| 831.66| > |MaintenanceMgr [worker]-59099| 3021.22|307.3| 783.84| > |MaintenanceMgr [worker]-59098| 2174.47|216.55| 716| > |MaintenanceMgr [worker]-59097| 3240.47|335.55| 846.99| > |MaintenanceMgr [worker]-59096| 2206.57|218.63| 752.62| > |MaintenanceMgr [worker]-59095| 2112.76|210.93| 720.67| > Snapshot from top > {code} >PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND > 59102 kudu 20 0 25.3g 15g 11m R 105.0 6.1 64:42.20 MaintenanceMgr > 59097 kudu 20 0 25.3g 15g 11m R 101.7 6.1 61:15.90 MaintenanceMgr > 59096 kudu 20 0 25.3g 15g 11m R 98.4 6.1 42:12.19 MaintenanceMgr > 59098 kudu 20 0 25.3g 15g 11m R 98.4 6.1 41:53.22 MaintenanceMgr > 59100 kudu 20 0 25.3g 15g 11m D 36.1 6.1 55:49.99 MaintenanceMgr > 59095 kudu 20 0 25.3g 15g 11m D 29.5 6.1 40:34.79 MaintenanceMgr > 59099 kudu 20 0 25.3g 15g 11m D 0.0 6.1 57:28.81 MaintenanceMgr > 59101 kudu 20 0 25.3g 15g 11m D 0.0 6.1 58:03.63 MaintenanceMgr > {code} > This was found using > kudu 1.8.0-SNAPSHOT (rev e70c5ee0d6d598ba53d002ebfc7f81bb2ceda404) > server uuid bd4cc6fdd79d4ebc8a4def27004b011d -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KUDU-2355) Uneven assignment of work across Maintenance Mgr threads
[ https://issues.apache.org/jira/browse/KUDU-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mostafa Mokhtar updated KUDU-2355: -- Summary: Uneven assignment of work across Maintenance Mgr threads (was: Uneven assignment of work across ) > Uneven assignment of work across Maintenance Mgr threads > > > Key: KUDU-2355 > URL: https://issues.apache.org/jira/browse/KUDU-2355 > Project: Kudu > Issue Type: Bug >Affects Versions: 1.8.0 >Reporter: Mostafa Mokhtar >Priority: Major > > While running inserts into a Kudu table I noticed that there is an uneven > assignment of work across maintenance threads, this reduces scalability of > the maintenance as more threads won't always mean better throughput. > |Thread name||Cumulative User CPU(s)||Cumulative Kernel > CPU(s)||Cumulative IO-wait(s)| > |MaintenanceMgr [worker]-59102| 3410.94|352.07| 866.71| > |MaintenanceMgr [worker]-59101| 3056.77|319.32| 794.1| > |MaintenanceMgr [worker]-59100| 2924.29|300.35| 831.66| > |MaintenanceMgr [worker]-59099| 3021.22|307.3| 783.84| > |MaintenanceMgr [worker]-59098| 2174.47|216.55| 716| > |MaintenanceMgr [worker]-59097| 3240.47|335.55| 846.99| > |MaintenanceMgr [worker]-59096| 2206.57|218.63| 752.62| > |MaintenanceMgr [worker]-59095| 2112.76|210.93| 720.67| > Snapshot from top > {code} >PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND > 59102 kudu 20 0 25.3g 15g 11m R 105.0 6.1 64:42.20 MaintenanceMgr > 59097 kudu 20 0 25.3g 15g 11m R 101.7 6.1 61:15.90 MaintenanceMgr > 59096 kudu 20 0 25.3g 15g 11m R 98.4 6.1 42:12.19 MaintenanceMgr > 59098 kudu 20 0 25.3g 15g 11m R 98.4 6.1 41:53.22 MaintenanceMgr > 59100 kudu 20 0 25.3g 15g 11m D 36.1 6.1 55:49.99 MaintenanceMgr > 59095 kudu 20 0 25.3g 15g 11m D 29.5 6.1 40:34.79 MaintenanceMgr > 59099 kudu 20 0 25.3g 15g 11m D 0.0 6.1 57:28.81 MaintenanceMgr > 59101 kudu 20 0 25.3g 15g 11m D 0.0 6.1 58:03.63 MaintenanceMgr > {code} > This was found using > kudu 1.8.0-SNAPSHOT (rev e70c5ee0d6d598ba53d002ebfc7f81bb2ceda404) > server uuid bd4cc6fdd79d4ebc8a4def27004b011d -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KUDU-2355) Uneven assignment o
Mostafa Mokhtar created KUDU-2355: - Summary: Uneven assignment o Key: KUDU-2355 URL: https://issues.apache.org/jira/browse/KUDU-2355 Project: Kudu Issue Type: Bug Affects Versions: 1.8.0 Reporter: Mostafa Mokhtar While running inserts into a Kudu table I noticed that there is an uneven assignment of work across maintenance threads, this reduces scalability of the maintenance as more threads won't always mean better throughput. |Thread name|| Cumulative User CPU(s)||Cumulative Kernel CPU(s) ||Cumulative IO-wait(s)| |MaintenanceMgr [worker]-59102| 3410.94|352.07| 866.71| |MaintenanceMgr [worker]-59101| 3056.77|319.32| 794.1| |MaintenanceMgr [worker]-59100| 2924.29|300.35| 831.66| |MaintenanceMgr [worker]-59099| 3021.22|307.3| 783.84| |MaintenanceMgr [worker]-59098| 2174.47|216.55| 716| |MaintenanceMgr [worker]-59097| 3240.47|335.55| 846.99| |MaintenanceMgr [worker]-59096| 2206.57|218.63| 752.62| |MaintenanceMgr [worker]-59095| 2112.76|210.93| 720.67| Snapshot from top {code} PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 59102 kudu 20 0 25.3g 15g 11m R 105.0 6.1 64:42.20 MaintenanceMgr 59097 kudu 20 0 25.3g 15g 11m R 101.7 6.1 61:15.90 MaintenanceMgr 59096 kudu 20 0 25.3g 15g 11m R 98.4 6.1 42:12.19 MaintenanceMgr 59098 kudu 20 0 25.3g 15g 11m R 98.4 6.1 41:53.22 MaintenanceMgr 59100 kudu 20 0 25.3g 15g 11m D 36.1 6.1 55:49.99 MaintenanceMgr 59095 kudu 20 0 25.3g 15g 11m D 29.5 6.1 40:34.79 MaintenanceMgr 59099 kudu 20 0 25.3g 15g 11m D 0.0 6.1 57:28.81 MaintenanceMgr 59101 kudu 20 0 25.3g 15g 11m D 0.0 6.1 58:03.63 MaintenanceMgr {code} This was found using kudu 1.8.0-SNAPSHOT (rev e70c5ee0d6d598ba53d002ebfc7f81bb2ceda404) server uuid bd4cc6fdd79d4ebc8a4def27004b011d -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "
[ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mostafa Mokhtar updated KUDU-2342: -- Attachment: Impala query profile.txt > Insert into Lineitem table with 1340 tablets on 129 node cluster failed with > "Failed to write batch " > - > > Key: KUDU-2342 > URL: https://issues.apache.org/jira/browse/KUDU-2342 > Project: Kudu > Issue Type: Bug > Components: tablet >Affects Versions: 1.7.0 >Reporter: Mostafa Mokhtar >Priority: Major > Labels: scalability > Attachments: Impala query profile.txt > > > While loading TPCH 30TB on 129 node cluster via Impala, write operation > failed with : > Query Status: Kudu error(s) reported, first error: Timed out: Failed to > write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 > attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 > (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after > 180.000s (SENT) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "
Mostafa Mokhtar created KUDU-2342: - Summary: Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch " Key: KUDU-2342 URL: https://issues.apache.org/jira/browse/KUDU-2342 Project: Kudu Issue Type: Bug Components: tablet Affects Versions: 1.7.0 Reporter: Mostafa Mokhtar While loading TPCH 30TB on 129 node cluster via Impala, write operation failed with : Query Status: Kudu error(s) reported, first error: Timed out: Failed to write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 180.000s (SENT) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2192) KRPC should have a timer to close stuck connections
[ https://issues.apache.org/jira/browse/KUDU-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336750#comment-16336750 ] Mostafa Mokhtar commented on KUDU-2192: --- [~kwho] Queries fail in a similar way when network partitioning is introduced after the query runs for a minute. {code} h4. _Status:_ TransmitData() to 10.00.000.29:27000 failed: Network error: recv error: Connection timed out (error 110) {code} > KRPC should have a timer to close stuck connections > --- > > Key: KUDU-2192 > URL: https://issues.apache.org/jira/browse/KUDU-2192 > Project: Kudu > Issue Type: Improvement > Components: rpc >Reporter: Michael Ho >Priority: Major > > If the remote host goes down or its network gets unplugged, all pending RPCs > to that host will be stuck if there is no timeout specified. While those RPCs > which have finished sending their payloads or those which haven't started > sending payloads can be cancelled quickly, those in mid-transmission (i.e. an > RPC at the front of the outbound queue with part of its payload sent already) > cannot be cancelled until the payload has been completely sent. Therefore, > it's beneficial to have a timeout to kill a connection if it's not making any > progress for an extended period of time so the RPC will fail and get unstuck. > The timeout may need to be conservatively large to avoid aggressive closing > of connections due to transient network issue. One can consider augmenting > the existing maintenance thread logic which checks for idle connection to > check for this kind of timeout. Please feel free to propose other > alternatives (e.g. TPC keepalive timeout) in this JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2192) KRPC should have a timer to close stuck connections
[ https://issues.apache.org/jira/browse/KUDU-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333051#comment-16333051 ] Mostafa Mokhtar commented on KUDU-2192: --- [~kwho] [~sailesh] [~hubert.sun] Tried network partitioning between two backends with KRPC enabled on 10.00.000.28 sudo /sbin/iptables -I INPUT -s 10.00.000.29 -j DROP And the query failed with the error below within 30 minutes Query Status: TransmitData() to 10.00.000.28:27000 failed: Network error: recv error: Connection timed out (error 110) Thrift failed in a similar way but in 15 minutes > KRPC should have a timer to close stuck connections > --- > > Key: KUDU-2192 > URL: https://issues.apache.org/jira/browse/KUDU-2192 > Project: Kudu > Issue Type: Improvement > Components: rpc >Reporter: Michael Ho >Priority: Major > > If the remote host goes down or its network gets unplugged, all pending RPCs > to that host will be stuck if there is no timeout specified. While those RPCs > which have finished sending their payloads or those which haven't started > sending payloads can be cancelled quickly, those in mid-transmission (i.e. an > RPC at the front of the outbound queue with part of its payload sent already) > cannot be cancelled until the payload has been completely sent. Therefore, > it's beneficial to have a timeout to kill a connection if it's not making any > progress for an extended period of time so the RPC will fail and get unstuck. > The timeout may need to be conservatively large to avoid aggressive closing > of connections due to transient network issue. One can consider augmenting > the existing maintenance thread logic which checks for idle connection to > check for this kind of timeout. Please feel free to propose other > alternatives (e.g. TPC keepalive timeout) in this JIRA. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput
[ https://issues.apache.org/jira/browse/KUDU-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154663#comment-16154663 ] Mostafa Mokhtar commented on KUDU-2086: --- [~tlipcon] What about switching to round robin distribution? > Uneven assignment of connections to Reactor threads creates skew and limits > transfer throughput > --- > > Key: KUDU-2086 > URL: https://issues.apache.org/jira/browse/KUDU-2086 > Project: Kudu > Issue Type: Bug > Components: rpc >Affects Versions: 1.4.0 >Reporter: Mostafa Mokhtar >Assignee: Michael Ho > > Uneven assignment of connections to Reactor threads causes a couple of > reactor threads to run @100% which limits overall system throughput. > Increasing the number of reactor threads alleviate the problem but some > threads are still running much hotter than others. > Snapshot below is from a 20 node cluster > {code} > ps -T -p 69387 | grep rpc | grep -v "00:00" | awk '{print $4,$0}' | sort > 00:03:17 69387 69596 ?00:03:17 rpc reactor-695 > 00:03:20 69387 69632 ?00:03:20 rpc reactor-696 > 00:03:21 69387 69607 ?00:03:21 rpc reactor-696 > 00:03:25 69387 69629 ?00:03:25 rpc reactor-696 > 00:03:26 69387 69594 ?00:03:26 rpc reactor-695 > 00:03:34 69387 69595 ?00:03:34 rpc reactor-695 > 00:03:35 69387 69625 ?00:03:35 rpc reactor-696 > 00:03:38 69387 69570 ?00:03:38 rpc reactor-695 > 00:03:38 69387 69620 ?00:03:38 rpc reactor-696 > 00:03:47 69387 69639 ?00:03:47 rpc reactor-696 > 00:03:48 69387 69593 ?00:03:48 rpc reactor-695 > 00:03:49 69387 69591 ?00:03:49 rpc reactor-695 > 00:04:04 69387 69600 ?00:04:04 rpc reactor-696 > 00:07:16 69387 69640 ?00:07:16 rpc reactor-696 > 00:07:39 69387 69616 ?00:07:39 rpc reactor-696 > 00:07:54 69387 69572 ?00:07:54 rpc reactor-695 > 00:09:10 69387 69613 ?00:09:10 rpc reactor-696 > 00:09:28 69387 69567 ?00:09:28 rpc reactor-695 > 00:09:39 69387 69603 ?00:09:39 rpc reactor-696 > 00:09:42 69387 69641 ?00:09:42 rpc reactor-696 > 00:09:59 69387 69604 ?00:09:59 rpc reactor-696 > 00:10:06 69387 69623 ?00:10:06 rpc reactor-696 > 00:10:43 69387 69636 ?00:10:43 rpc reactor-696 > 00:10:59 69387 69642 ?00:10:59 rpc reactor-696 > 00:11:28 69387 69585 ?00:11:28 rpc reactor-695 > 00:12:43 69387 69598 ?00:12:43 rpc reactor-695 > 00:15:42 69387 69578 ?00:15:42 rpc reactor-695 > 00:16:10 69387 69614 ?00:16:10 rpc reactor-696 > 00:17:43 69387 69575 ?00:17:43 rpc reactor-695 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KUDU-2086) Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput
Mostafa Mokhtar created KUDU-2086: - Summary: Uneven assignment of connections to Reactor threads creates skew and limits transfer throughput Key: KUDU-2086 URL: https://issues.apache.org/jira/browse/KUDU-2086 Project: Kudu Issue Type: Bug Components: rpc Affects Versions: 1.4.0 Reporter: Mostafa Mokhtar Uneven assignment of connections to Reactor threads causes a couple of reactor threads to run @100% which limits overall system throughput. Increasing the number of reactor threads alleviate the problem but some threads are still running much hotter than others. Snapshot below is from a 20 node cluster {code} ps -T -p 69387 | grep rpc | grep -v "00:00" | awk '{print $4,$0}' | sort 00:03:17 69387 69596 ?00:03:17 rpc reactor-695 00:03:20 69387 69632 ?00:03:20 rpc reactor-696 00:03:21 69387 69607 ?00:03:21 rpc reactor-696 00:03:25 69387 69629 ?00:03:25 rpc reactor-696 00:03:26 69387 69594 ?00:03:26 rpc reactor-695 00:03:34 69387 69595 ?00:03:34 rpc reactor-695 00:03:35 69387 69625 ?00:03:35 rpc reactor-696 00:03:38 69387 69570 ?00:03:38 rpc reactor-695 00:03:38 69387 69620 ?00:03:38 rpc reactor-696 00:03:47 69387 69639 ?00:03:47 rpc reactor-696 00:03:48 69387 69593 ?00:03:48 rpc reactor-695 00:03:49 69387 69591 ?00:03:49 rpc reactor-695 00:04:04 69387 69600 ?00:04:04 rpc reactor-696 00:07:16 69387 69640 ?00:07:16 rpc reactor-696 00:07:39 69387 69616 ?00:07:39 rpc reactor-696 00:07:54 69387 69572 ?00:07:54 rpc reactor-695 00:09:10 69387 69613 ?00:09:10 rpc reactor-696 00:09:28 69387 69567 ?00:09:28 rpc reactor-695 00:09:39 69387 69603 ?00:09:39 rpc reactor-696 00:09:42 69387 69641 ?00:09:42 rpc reactor-696 00:09:59 69387 69604 ?00:09:59 rpc reactor-696 00:10:06 69387 69623 ?00:10:06 rpc reactor-696 00:10:43 69387 69636 ?00:10:43 rpc reactor-696 00:10:59 69387 69642 ?00:10:59 rpc reactor-696 00:11:28 69387 69585 ?00:11:28 rpc reactor-695 00:12:43 69387 69598 ?00:12:43 rpc reactor-695 00:15:42 69387 69578 ?00:15:42 rpc reactor-695 00:16:10 69387 69614 ?00:16:10 rpc reactor-696 00:17:43 69387 69575 ?00:17:43 rpc reactor-695 {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KUDU-1447) Document recommendation to disable THP
[ https://issues.apache.org/jira/browse/KUDU-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1584#comment-1584 ] Mostafa Mokhtar commented on KUDU-1447: --- [~tlipcon] Recommendations in CM are easily overlooked, it would help if Kudu reports warnings in it's logs until madv_nohugepage is used. > Document recommendation to disable THP > -- > > Key: KUDU-1447 > URL: https://issues.apache.org/jira/browse/KUDU-1447 > Project: Kudu > Issue Type: Improvement > Components: documentation >Affects Versions: 0.8.0 >Reporter: Todd Lipcon >Assignee: Todd Lipcon > > Doing a bunch of cluster testing, I finally got to the root of why sometimes > threads take several seconds to start up, causing various timeout issues, > false elections, etc. It turns out that khugepaged does synchronous page > compaction while holding a process's mmap semaphore, and when that's > concurrent with lots of IO, can block for several seconds. > https://lkml.org/lkml/2011/7/26/103 > To avoid this, we should tell users to set hugepages to "madvise" or "never" > -- it's not sufficient to just disable defrag, because khugepaged still runs > in the background in that case and causes this sporadic issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (KUDU-1754) Columns should default to NULL opposed to NOT NULL
Mostafa Mokhtar created KUDU-1754: - Summary: Columns should default to NULL opposed to NOT NULL Key: KUDU-1754 URL: https://issues.apache.org/jira/browse/KUDU-1754 Project: Kudu Issue Type: Bug Components: api Affects Versions: 1.2.0 Reporter: Mostafa Mokhtar Columns default to "NOT NULL" if the nullability field is not specified. This behavior opposes Oracle, Teradata, MSSqlserver, MySQL... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KUDU-1366) Consider switching to jemalloc
[ https://issues.apache.org/jira/browse/KUDU-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15494959#comment-15494959 ] Mostafa Mokhtar commented on KUDU-1366: --- [~tarmstrong] FYI > Consider switching to jemalloc > -- > > Key: KUDU-1366 > URL: https://issues.apache.org/jira/browse/KUDU-1366 > Project: Kudu > Issue Type: Bug > Components: build >Reporter: Todd Lipcon > Attachments: Kudu Benchmarks.pdf, Kudu Benchmarks.pdf > > > We spend a fair amount of time in the allocator. While we could spend some > time trying to use arenas more, it's also worth considering switching > allocators. I ran a few quick tests with jemalloc 4.1 and it seems like it > might be better than the version of tcmalloc that we use (and has much more > active development) -- This message was sent by Atlassian JIRA (v6.3.4#6332)