[jira] [Commented] (MAPREDUCE-5896) Allow InputSplits to indicate which locations have the block cached in memory
[ https://issues.apache.org/jira/browse/MAPREDUCE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035988#comment-14035988 ] Karthik Kambatla commented on MAPREDUCE-5896: - Looks good to me. +1. Will commit this later today. Allow InputSplits to indicate which locations have the block cached in memory - Key: MAPREDUCE-5896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5896 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-5896-1.patch, MAPREDUCE-5896-2.patch, MAPREDUCE-5896.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5896) Allow InputSplits to indicate which locations have the block cached in memory
[ https://issues.apache.org/jira/browse/MAPREDUCE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027977#comment-14027977 ] Sandy Ryza commented on MAPREDUCE-5896: --- bq. Can we make InputSplitLocationInfo extend InputSplit? It doesn't make sense for any class to implement only InputSplitLocationInfo without implementing InputSplit. Will do. bq. Nothing to do with this patch. It is unfortunate that mapreduce.InputSplit doesn't implement mapred.InputSplit. Would it be easy to fix it? Not entirely sure the reasoning there, but as this stuff can have binary compatibility implications in mysterious ways, I'd rather not touch it if we don't need to. bq. The following two constants should probably be in SplitLocationInfo? They're only used in FileSplit and not in SplitLocationInfo - is there utility in moving them away from where they're used? I'd like to avoid adding these constants to the API because, when we include additional storage types, each SplitLocationInfo could end up as a union of storage types - needing to add a ON_DISK_AND_IN_FLASH_AND_IN_MEMORY would be ugly. bq. Instead of assigning ON_DISK by default, would it make sense to set it post null-check after the loop for checking if it is in memory. Any advantage to this? It would add extra code, an extra branch, and I don't think be particularly more readable. bq. Do you think it would make sense to include the string corresponding to the location in SplitLocationInfo itself? Will do. Allow InputSplits to indicate which locations have the block cached in memory - Key: MAPREDUCE-5896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5896 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-5896-1.patch, MAPREDUCE-5896.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5896) Allow InputSplits to indicate which locations have the block cached in memory
[ https://issues.apache.org/jira/browse/MAPREDUCE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028710#comment-14028710 ] Hadoop QA commented on MAPREDUCE-5896: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12649975/MAPREDUCE-5896-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4650//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4650//console This message is automatically generated. Allow InputSplits to indicate which locations have the block cached in memory - Key: MAPREDUCE-5896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5896 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-5896-1.patch, MAPREDUCE-5896-2.patch, MAPREDUCE-5896.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5896) Allow InputSplits to indicate which locations have the block cached in memory
[ https://issues.apache.org/jira/browse/MAPREDUCE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026541#comment-14026541 ] Tom White commented on MAPREDUCE-5896: -- +1 for the API changes. Allow InputSplits to indicate which locations have the block cached in memory - Key: MAPREDUCE-5896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5896 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-5896-1.patch, MAPREDUCE-5896.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5896) Allow InputSplits to indicate which locations have the block cached in memory
[ https://issues.apache.org/jira/browse/MAPREDUCE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027022#comment-14027022 ] Karthik Kambatla commented on MAPREDUCE-5896: - Review comments: # Can we make InputSplitLocationInfo extend InputSplit? It doesn't make sense for any class to implement only InputSplitLocationInfo without implementing InputSplit. # I am uncomfortable having to depend on index matching between InputSplit#getLocations and InputSplit#getLocationInfo. Do you think it would make sense to include the string corresponding to the location in SplitLocationInfo itself? We could deprecate InputSplit#getLocations(). Users are to be expected to use getLocationInfos instead. # Nothing to do with this patch. It is unfortunate that mapreduce.InputSplit doesn't implement mapred.InputSplit. Would it be easy to fix it? # Nit: The following two constants should probably be in SplitLocationInfo? {code} private static final SplitLocationInfo ON_DISK = new SplitLocationInfo(false); private static final SplitLocationInfo IN_MEMORY = new SplitLocationInfo(true); {code} # Nit: Instead of assigning ON_DISK by default, would it make sense to set it post null-check after the loop for checking if it is in memory. {code} for (int i = 0; i hosts.length; i++) { hostInfos[i] = ON_DISK; // because N will be tiny, scanning is probably faster than a HashSet for (String inMemoryHost : inMemoryHosts) { {code} Allow InputSplits to indicate which locations have the block cached in memory - Key: MAPREDUCE-5896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5896 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-5896-1.patch, MAPREDUCE-5896.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5896) Allow InputSplits to indicate which locations have the block cached in memory
[ https://issues.apache.org/jira/browse/MAPREDUCE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011391#comment-14011391 ] Sandy Ryza commented on MAPREDUCE-5896: --- Updated patch sets all the new APIs to Evolving, fixes the typo that Tom noticed, includes cached hosts in mapred.FileInputFormat split generation, and adds tests. Allow InputSplits to indicate which locations have the block cached in memory - Key: MAPREDUCE-5896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5896 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-5896-1.patch, MAPREDUCE-5896.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5896) Allow InputSplits to indicate which locations have the block cached in memory
[ https://issues.apache.org/jira/browse/MAPREDUCE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011441#comment-14011441 ] Hadoop QA commented on MAPREDUCE-5896: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12647178/MAPREDUCE-5896-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4630//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4630//console This message is automatically generated. Allow InputSplits to indicate which locations have the block cached in memory - Key: MAPREDUCE-5896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5896 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-5896-1.patch, MAPREDUCE-5896.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5896) Allow InputSplits to indicate which locations have the block cached in memory
[ https://issues.apache.org/jira/browse/MAPREDUCE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009664#comment-14009664 ] Tom White commented on MAPREDUCE-5896: -- @Evolving would mean that it could change incompatibly between 2.4.0 and 2.5.0 say, but not between 2.4.0 and 2.4.1 - and that might be the right trade off here. Hopefully SplitLocationInfo would not need to be changed incompatibly, but it's hard to know for sure until it's being used. Allow InputSplits to indicate which locations have the block cached in memory - Key: MAPREDUCE-5896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5896 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-5896.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5896) Allow InputSplits to indicate which locations have the block cached in memory
[ https://issues.apache.org/jira/browse/MAPREDUCE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008903#comment-14008903 ] Tom White commented on MAPREDUCE-5896: -- From a compatibility point of view I think the changes are OK, although SplitLocationInfo should not (yet) be marked as @Stable. Also, mapred.FileSplit has a new getSplitLocationInfo() method, but it should be getLocationInfo(). Add an @Override annotation to catch this. How would the split calculation in FileInputFormat change to use hosts with a cached copy of a particular block? It would be worth creating a trial implementation to test that the changes are sufficient. Allow InputSplits to indicate which locations have the block cached in memory - Key: MAPREDUCE-5896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5896 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-5896.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5896) Allow InputSplits to indicate which locations have the block cached in memory
[ https://issues.apache.org/jira/browse/MAPREDUCE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008925#comment-14008925 ] Hadoop QA commented on MAPREDUCE-5896: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12646096/MAPREDUCE-5896.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4620//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4620//console This message is automatically generated. Allow InputSplits to indicate which locations have the block cached in memory - Key: MAPREDUCE-5896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5896 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-5896.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5896) Allow InputSplits to indicate which locations have the block cached in memory
[ https://issues.apache.org/jira/browse/MAPREDUCE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009055#comment-14009055 ] Sandy Ryza commented on MAPREDUCE-5896: --- What should the criteria be for marking this stable? I'd like to start using this in downstream projects (Spark, and I believe Tez could benefit as well) as soon as possible, and an Evolving annotation would prevent this. Allow InputSplits to indicate which locations have the block cached in memory - Key: MAPREDUCE-5896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5896 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-5896.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5896) Allow InputSplits to indicate which locations have the block cached in memory
[ https://issues.apache.org/jira/browse/MAPREDUCE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004424#comment-14004424 ] Sandy Ryza commented on MAPREDUCE-5896: --- Given HDFS's plans for hierarchical storage management, I think it would be good to make this extensible to handle storage mediums beyond memory. I talked this over with [~andrew.wang] and [~atm] and we think the right interface would be something like a SplitLocationInfo class, with isInMemory() and isOnDisk() methods. We can later add isInFlash() and possibly even getDisk() to return which disk the data is on. InputSplits would have a SplitLocationInfo[] getLocationInfo() method that returns info about how the data is stored on each host returned by getLocations(). Allow InputSplits to indicate which locations have the block cached in memory - Key: MAPREDUCE-5896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5896 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5896) Allow InputSplits to indicate which locations have the block cached in memory
[ https://issues.apache.org/jira/browse/MAPREDUCE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005194#comment-14005194 ] Sandy Ryza commented on MAPREDUCE-5896: --- Uploaded a POC patch. I'll add some tests if others think the APIs make sense. Allow InputSplits to indicate which locations have the block cached in memory - Key: MAPREDUCE-5896 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5896 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 2.4.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-5896.patch -- This message was sent by Atlassian JIRA (v6.2#6252)