[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption
[ https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368903#comment-16368903 ] ASF GitHub Bot commented on DRILL-6143: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/1119 > Make Fragment Runner's RPC Timeout a SystemOption > - > > Key: DRILL-6143 > URL: https://issues.apache.org/jira/browse/DRILL-6143 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Labels: ready-to-commit > Fix For: 1.13.0 > > > Queries frequently fail sporadically on some clusters due to the following > error > {code} > oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION > ERROR: Exceeded timeout (25000) while waiting send intermediate work > fragments to remote nodes. Sent 5 and only heard response back from 4 nodes. > {code} > This error happens because the FragmentsRunner has a hardcoded timeout > RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the > timeout to 10 seconds resolved the sporadic failures that were observed. This > timeout should be changed to 10 and should also be configurable via the > SystemOptionManager -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption
[ https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16359170#comment-16359170 ] ASF GitHub Bot commented on DRILL-6143: --- Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1119 LGTM > Make Fragment Runner's RPC Timeout a SystemOption > - > > Key: DRILL-6143 > URL: https://issues.apache.org/jira/browse/DRILL-6143 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Labels: ready-to-commit > Fix For: 1.13.0 > > > Queries frequently fail sporadically on some clusters due to the following > error > {code} > oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION > ERROR: Exceeded timeout (25000) while waiting send intermediate work > fragments to remote nodes. Sent 5 and only heard response back from 4 nodes. > {code} > This error happens because the FragmentsRunner has a hardcoded timeout > RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the > timeout to 10 seconds resolved the sporadic failures that were observed. This > timeout should be changed to 10 and should also be configurable via the > SystemOptionManager -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption
[ https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358955#comment-16358955 ] ASF GitHub Bot commented on DRILL-6143: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1119#discussion_r167346417 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java --- @@ -212,7 +212,8 @@ new OptionDefinition(ExecConstants.CPU_LOAD_AVERAGE), new OptionDefinition(ExecConstants.ENABLE_VECTOR_VALIDATOR), new OptionDefinition(ExecConstants.ENABLE_ITERATOR_VALIDATOR), - new OptionDefinition(ExecConstants.OUTPUT_BATCH_SIZE_VALIDATOR, new OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, true, false)) + new OptionDefinition(ExecConstants.OUTPUT_BATCH_SIZE_VALIDATOR, new OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, true, false)), + new OptionDefinition(ExecConstants.FRAG_RUNNER_RPC_TIMEOUT_VALIDATOR, new OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, false, true)), --- End diff -- internal should be true since we want this to show up in the internal options table and not the standard system options table. Changing to adminOnly = true seems reasonable. > Make Fragment Runner's RPC Timeout a SystemOption > - > > Key: DRILL-6143 > URL: https://issues.apache.org/jira/browse/DRILL-6143 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > Queries frequently fail sporadically on some clusters due to the following > error > {code} > oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION > ERROR: Exceeded timeout (25000) while waiting send intermediate work > fragments to remote nodes. Sent 5 and only heard response back from 4 nodes. > {code} > This error happens because the FragmentsRunner has a hardcoded timeout > RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the > timeout to 10 seconds resolved the sporadic failures that were observed. This > timeout should be changed to 10 and should also be configurable via the > SystemOptionManager -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption
[ https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358942#comment-16358942 ] ASF GitHub Bot commented on DRILL-6143: --- Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1119#discussion_r167343642 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java --- @@ -212,7 +212,8 @@ new OptionDefinition(ExecConstants.CPU_LOAD_AVERAGE), new OptionDefinition(ExecConstants.ENABLE_VECTOR_VALIDATOR), new OptionDefinition(ExecConstants.ENABLE_ITERATOR_VALIDATOR), - new OptionDefinition(ExecConstants.OUTPUT_BATCH_SIZE_VALIDATOR, new OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, true, false)) + new OptionDefinition(ExecConstants.OUTPUT_BATCH_SIZE_VALIDATOR, new OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, true, false)), + new OptionDefinition(ExecConstants.FRAG_RUNNER_RPC_TIMEOUT_VALIDATOR, new OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, false, true)), --- End diff -- Question: Should the last two parameters be instead: adminOnly = true internal = false > Make Fragment Runner's RPC Timeout a SystemOption > - > > Key: DRILL-6143 > URL: https://issues.apache.org/jira/browse/DRILL-6143 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > Queries frequently fail sporadically on some clusters due to the following > error > {code} > oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION > ERROR: Exceeded timeout (25000) while waiting send intermediate work > fragments to remote nodes. Sent 5 and only heard response back from 4 nodes. > {code} > This error happens because the FragmentsRunner has a hardcoded timeout > RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the > timeout to 10 seconds resolved the sporadic failures that were observed. This > timeout should be changed to 10 and should also be configurable via the > SystemOptionManager -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption
[ https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358913#comment-16358913 ] ASF GitHub Bot commented on DRILL-6143: --- Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1119#discussion_r167338469 --- Diff: exec/java-exec/src/main/resources/drill-module.conf --- @@ -413,6 +413,7 @@ drill.exec.options: { # to start at least 2 partitions then HashAgg fallbacks to this case. It can be # enabled by setting this flag to true. By default it's set to false such that # query will fail if there is not enough memory +drill.exec.rpc.fragrunner.timeout: 3, --- End diff -- Thanks for catching the ordering. I reduced the default to 1. > Make Fragment Runner's RPC Timeout a SystemOption > - > > Key: DRILL-6143 > URL: https://issues.apache.org/jira/browse/DRILL-6143 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > Queries frequently fail sporadically on some clusters due to the following > error > {code} > oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION > ERROR: Exceeded timeout (25000) while waiting send intermediate work > fragments to remote nodes. Sent 5 and only heard response back from 4 nodes. > {code} > This error happens because the FragmentsRunner has a hardcoded timeout > RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the > timeout to 10 seconds resolved the sporadic failures that were observed. This > timeout should be changed to 10 and should also be configurable via the > SystemOptionManager -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption
[ https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358802#comment-16358802 ] ASF GitHub Bot commented on DRILL-6143: --- Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1119#discussion_r167312527 --- Diff: exec/java-exec/src/main/resources/drill-module.conf --- @@ -413,6 +413,7 @@ drill.exec.options: { # to start at least 2 partitions then HashAgg fallbacks to this case. It can be # enabled by setting this flag to true. By default it's set to false such that # query will fail if there is not enough memory +drill.exec.rpc.fragrunner.timeout: 3, --- End diff -- The value looks a little bit high and it needs to be moved either below `drill.exec.hashagg.fallback.enabled` or above the preceding comment, otherwise LGTM. > Make Fragment Runner's RPC Timeout a SystemOption > - > > Key: DRILL-6143 > URL: https://issues.apache.org/jira/browse/DRILL-6143 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > Queries frequently fail sporadically on some clusters due to the following > error > {code} > oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION > ERROR: Exceeded timeout (25000) while waiting send intermediate work > fragments to remote nodes. Sent 5 and only heard response back from 4 nodes. > {code} > This error happens because the FragmentsRunner has a hardcoded timeout > RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the > timeout to 10 seconds resolved the sporadic failures that were observed. This > timeout should be changed to 10 and should also be configurable via the > SystemOptionManager -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption
[ https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358752#comment-16358752 ] ASF GitHub Bot commented on DRILL-6143: --- Github user priteshm commented on the issue: https://github.com/apache/drill/pull/1119 @arina-ielchiieva is on vacation. @vrozov, @Ben-Zvi can you take a look? > Make Fragment Runner's RPC Timeout a SystemOption > - > > Key: DRILL-6143 > URL: https://issues.apache.org/jira/browse/DRILL-6143 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > Queries frequently fail sporadically on some clusters due to the following > error > {code} > oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION > ERROR: Exceeded timeout (25000) while waiting send intermediate work > fragments to remote nodes. Sent 5 and only heard response back from 4 nodes. > {code} > This error happens because the FragmentsRunner has a hardcoded timeout > RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the > timeout to 10 seconds resolved the sporadic failures that were observed. This > timeout should be changed to 10 and should also be configurable via the > SystemOptionManager -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption
[ https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357932#comment-16357932 ] Timothy Farkas commented on DRILL-6143: --- There seem to be cases where a drillbit does not send a response to the FragmentRunner with a larger timeout. There is likely another issue which can cause a response never to be sent in some cases. I will create a separate ticket for this case. > Make Fragment Runner's RPC Timeout a SystemOption > - > > Key: DRILL-6143 > URL: https://issues.apache.org/jira/browse/DRILL-6143 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > Fix For: 1.13.0 > > > Queries frequently fail sporadically on some clusters due to the following > error > {code} > oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION > ERROR: Exceeded timeout (25000) while waiting send intermediate work > fragments to remote nodes. Sent 5 and only heard response back from 4 nodes. > {code} > This error happens because the FragmentsRunner has a hardcoded timeout > RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the > timeout to 10 seconds resolved the sporadic failures that were observed. This > timeout should be changed to 10 and should also be configurable via the > SystemOptionManager -- This message was sent by Atlassian JIRA (v7.6.3#76005)