[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: YARN-2410-v11.patch > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Attachments: YARN-2410-v1.patch, YARN-2410-v10.patch, > YARN-2410-v11.patch, YARN-2410-v2.patch, YARN-2410-v3.patch, > YARN-2410-v4.patch, YARN-2410-v5.patch, YARN-2410-v6.patch, > YARN-2410-v7.patch, YARN-2410-v8.patch, YARN-2410-v9.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: (was: YARN-2410-v11.patch) > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Attachments: YARN-2410-v1.patch, YARN-2410-v10.patch, > YARN-2410-v2.patch, YARN-2410-v3.patch, YARN-2410-v4.patch, > YARN-2410-v5.patch, YARN-2410-v6.patch, YARN-2410-v7.patch, > YARN-2410-v8.patch, YARN-2410-v9.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: YARN-2410-v11.patch Adding documentation comments. > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Attachments: YARN-2410-v1.patch, YARN-2410-v10.patch, > YARN-2410-v11.patch, YARN-2410-v2.patch, YARN-2410-v3.patch, > YARN-2410-v4.patch, YARN-2410-v5.patch, YARN-2410-v6.patch, > YARN-2410-v7.patch, YARN-2410-v8.patch, YARN-2410-v9.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: YARN-2410-v10.patch Fixing whitespace and checkstyle issues. > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Attachments: YARN-2410-v1.patch, YARN-2410-v10.patch, > YARN-2410-v2.patch, YARN-2410-v3.patch, YARN-2410-v4.patch, > YARN-2410-v5.patch, YARN-2410-v6.patch, YARN-2410-v7.patch, > YARN-2410-v8.patch, YARN-2410-v9.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: YARN-2410-v9.patch sendMap should have only reduceContext as an argument. Test refactored to have helper methods. > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, > YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, > YARN-2410-v6.patch, YARN-2410-v7.patch, YARN-2410-v8.patch, YARN-2410-v9.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: YARN-2410-v8.patch Correcting 80 character line limit for test > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, > YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, > YARN-2410-v6.patch, YARN-2410-v7.patch, YARN-2410-v8.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: YARN-2410-v7.patch Modified ShuffleHandler to not use channel attachments. Moved MockNetty code to a helper method. > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, > YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, > YARN-2410-v6.patch, YARN-2410-v7.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: YARN-2410-v6.patch Thank you so much [~jlowe] for the detailed feedback. I have made all but 2 changes and request your further comments on that. {quote} Actually I'm not really sure why SendMapOutputParams exists separate from ReduceContext. There should be a one-to-one relationship there. {quote} I totally agree. The only reason was findbugs which does not allow more than 7 parameters in a function call( or the constructor that would populate these values). If this is not an issue, I can move them into a single class. For now I have made SendMapOutputParams an inner class to ReduceContext. {quote} Why was reduceContext added as a TestShuffleHandler instance variable? It's specific to the new test. {quote} The reduceContext is a variable holds the value set by the setAttachment() method and is used by the getAttachment() answer. If I declare it in the test method, it needs be final which cannot be done due to it being used by the setter. I am looking for another way. Let me know what you think. All other items have been done. > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, > YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, YARN-2410-v6.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-2410: Target Version/s: 2.7.2 Fix Version/s: (was: 2.7.2) Updating field Target version as 2.7.2. Fix version is added when the issue is committed!! > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, > YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: YARN-2410-v5.patch This is the latest revised patch. A messageReceived() call uses two counters mapsToWait and mapsToSend within the ReduceContext class for throttling the number of sendMapOutput calls. Due to asynchronous nature of Netty, these counters are atomic. A revised test case that mocks Netty operations is also included. Every completed IO operation by sendMapOutput will start another until the entire mapIds list for a given request is processed. > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Fix For: 2.7.2 > > Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, > YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: YARN-2410-v4.patch Revamped patch that uses a Map to store the number of openfiles per reduceId and passes the updated openfiles value through the channel as an attachment. The number of files that can be open per reducer is configurable. > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Fix For: 2.7.2 > > Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, > YARN-2410-v3.patch, YARN-2410-v4.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: YARN-2410-v3.patch > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Fix For: 2.7.2 > > Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, > YARN-2410-v3.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: YARN-2410-v2.patch Patch without no-prefix as git apply works without no-prefix. > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Fix For: 2.7.2 > > Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kuhu Shukla updated YARN-2410: -- Attachment: YARN-2410-v1.patch The ShuffleHandler messageReceived calls sendMapOutput only if the number of open files for a given reduceId is within a configurable limit value (mapreduce.shuffle.map.filecount). The count is incremented per call of sendMapOutput(). The channel is closed after this limit is reached. > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > Attachments: YARN-2410-v1.patch > > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-2410: -- Assignee: Kuhu Shukla (was: Chen He) > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Kuhu Shukla > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors
[ https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2410: - Summary: Nodemanager ShuffleHandler can possible exhaust file descriptors (was: Nodemanager ShuffleHandler can easily exhaust file descriptors) > Nodemanager ShuffleHandler can possible exhaust file descriptors > > > Key: YARN-2410 > URL: https://issues.apache.org/jira/browse/YARN-2410 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Nathan Roberts >Assignee: Chen He > > The async nature of the shufflehandler can cause it to open a huge number of > file descriptors, when it runs out it crashes. > Scenario: > Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node. > Let's say all 6K reduces hit a node at about same time asking for their > outputs. Each reducer will ask for all 40 map outputs over a single socket in > a > single request (not necessarily all 40 at once, but with coalescing it is > likely to be a large number). > sendMapOutput() will open the file for random reading and then perform an > async transfer of the particular portion of this file(). This will > theoretically > happen 6000*40=24 times which will run the NM out of file descriptors and > cause it to crash. > The algorithm should be refactored a little to not open the fds until they're > actually needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)