[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2567: - Status: Patch Available (was: Open) > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, > HADOOP-2567-sortvalidate.patch, HADOOP-2567-tests.patch, HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2567: - Status: Open (was: Patch Available) HADOOP-2646 has been added to address the SortValidator issue. > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, > HADOOP-2567-sortvalidate.patch, HADOOP-2567-tests.patch, HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2567: - Attachment: HADOOP-2567-tests.patch Adding tests of new getHomeDirectory() method. > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, > HADOOP-2567-sortvalidate.patch, HADOOP-2567-tests.patch, HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2638) Add close of idle connection to DFSClient and to DataNode DataXceiveServer
[ https://issues.apache.org/jira/browse/HADOOP-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560120#action_12560120 ] Doug Cutting commented on HADOOP-2638: -- > Seems clumsier than just closing idle connections [ ... ] It also gives you explicit control. If you do need to iterate over a range of keys, then you can wait to release the connection until you've completed the iteration, while a pread-based approach would have to open a new connection per buffer refill or somesuch. As for simplicity, background threads that time stuff out are hairy and easy to get subtly wrong. Folks also don't generally like more background threads running in the client's JVM, since clients should be lean-and-mean. > Add close of idle connection to DFSClient and to DataNode DataXceiveServer > -- > > Key: HADOOP-2638 > URL: https://issues.apache.org/jira/browse/HADOOP-2638 > Project: Hadoop > Issue Type: Improvement > Components: dfs >Reporter: stack > > This issue is for adding timeout and shutdown of idle DFSClient <-> DataNode > connections. > Applications can have DFS usage patterns than deviate from that of MR 'norm' > where files are generally opened, sucked down as fast as is possible, and > then closed. For example, at the other extreme, hbase wants to support fast > random reading of key values over a sometimes relatively large set of > MapFiles or MapFile equivalents. To avoid paying startup costs on every > random read -- opening the file and reading in the index each time -- hbase > just keeps all of its MapFiles open all the time. > In an hbase cluster of any significant size, this can add up to lots of file > handles per process: See HADOOP-2577, " [hbase] Scaling: Too many open file > handles to datanodes" for an accounting. > Given how DFSClient and DataXceiveServer interact when random reading, and > given past observations that have the client-side file handles mostly stuck > in CLOSE_WAIT (See HADOOP-2341, 'Datanode active connections never returns to > 0'), a suggestion made up on the list today, that idle connections should be > timedout and closed, would help applications that have hbase-like access > patterns conserve file handles and allow them scale. > Below is context that comes of the mailing list under the subject: 'Re: > Multiplexing sockets in DFSClient/datanodes?' > {code} > stack wrote: > > Doug Cutting wrote: > >> RPC also tears down idle connections, which HDFS does not. I wonder how > >> much doing that alone might help your case? That would probably be much > >> simpler to implement. Both client and server must already handle > >> connection failures, so it shouldn't be too great of a change to have one > >> or both sides actively close things down if they're idle for more than a > >> few seconds. > > > > If we added tear down of idle sockets, that'd work for us and, as you > > suggest, should be easier to do than rewriting the client to use async i/o. > > Currently, random reading, its probably rare that the currently opened > > HDFS block has the wanted offset and so a tear down of the current socket > > and an open of a new one is being done anyways. > HADOOP-2346 helps with the Datanode side of the problem. We still need > DFSClient to clean up idle connections (otherwise these sockets will stay in > CLOSE_WAIT state on the client). This would require an extra thread on client > to clean up these connections. You could file a jira for it. > Raghu. > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2646) SortValidator broken with fully-qualified working directories
[ https://issues.apache.org/jira/browse/HADOOP-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2646: - Attachment: HADOOP-2646.patch This patch is known to fix SortValidator on single-node clusters, but may have not work on multi-node clusters. See HADOOP-2567 for details. > SortValidator broken with fully-qualified working directories > - > > Key: HADOOP-2646 > URL: https://issues.apache.org/jira/browse/HADOOP-2646 > Project: Hadoop > Issue Type: Bug > Components: test >Reporter: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2646.patch > > > The sort validator is broken by HADOOP-2567. In particular, it no longer > works when DistributedFileSystem#getWorkingDirectory() returns a > fully-qualified path. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HADOOP-2646) SortValidator broken with fully-qualified working directories
SortValidator broken with fully-qualified working directories - Key: HADOOP-2646 URL: https://issues.apache.org/jira/browse/HADOOP-2646 Project: Hadoop Issue Type: Bug Components: test Reporter: Doug Cutting Fix For: 0.16.0 The sort validator is broken by HADOOP-2567. In particular, it no longer works when DistributedFileSystem#getWorkingDirectory() returns a fully-qualified path. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560106#action_12560106 ] Doug Cutting commented on HADOOP-2567: -- I am unable to reproduce this failure. The single-machine instructions you gave above generates four input files and one output file. I modified the sort command line so that four output files are used, since the code in question involves determining whether a given input to the validator is a sort input or output, but that still validated correctly. Perhaps Arun, who originally wrote the validator, could have a look at this? > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, > HADOOP-2567-sortvalidate.patch, HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2634) Deprecate exists() and isDir() to simplify ClientProtocol.
[ https://issues.apache.org/jira/browse/HADOOP-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560101#action_12560101 ] Doug Cutting commented on HADOOP-2634: -- Hairong has addressed these inconsistencies in HADOOP-2566. Yes, the implementation of exists() in terms of getFileStatus() would be simple. However it is considered bad style to use exceptions for normal control flow, and exists() returning false is a normal condition. We might just have to live with that... > Deprecate exists() and isDir() to simplify ClientProtocol. > -- > > Key: HADOOP-2634 > URL: https://issues.apache.org/jira/browse/HADOOP-2634 > Project: Hadoop > Issue Type: Improvement > Components: dfs >Affects Versions: 0.15.0 >Reporter: Konstantin Shvachko > > ClientProtocol can be simplified by removing two methods > {code} > public boolean exists(String src) throws IOException; > public boolean isDir(String src) throws IOException; > {code} > This is a redundant api, which can be implemented in DFSClient as convenience > methods using > {code} > public DFSFileInfo getFileInfo(String src) throws IOException; > {code} > Note that we already deprecated several Filesystem method and advised to use > getFileStatus() instead. > Should we deprecate them in 0.16? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2638) Add close of idle connection to DFSClient and to DataNode DataXceiveServer
[ https://issues.apache.org/jira/browse/HADOOP-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560090#action_12560090 ] Doug Cutting commented on HADOOP-2638: -- > I'd be interested to hear how the aforementioned 'pread' would be better than > whats going on underneath the MapFile.get. In HDFS, each call to pread opens a new connection to a datanode, reads the requested data, then closes the connection. If the requested data spans multiple blocks it will open connections for each block as required, it will re-try on network errors, etc. But, bottom-line, no connection is left open. When you initially open an HDFS file it does not open a connection to any datanodes: only the namenode is consulted on open. Once you call read(byte[]), a datanode connection is generally held open. But, if one only ever uses pread, then no connection is held open. Another approach to fixing this would be to add an FSInputStream method to close the connection to the datanode, perhaps called release(). The stream would still be open and at the same position, but some attached resources may be released. The default implementation would do nothing, but for HDFS it would close any open datanode connection. Then we could add a SequenceFile#release(), and similarly for MapFile. Then, after a call to MapFile#get() you could explicitly release the underlying connection. That might be the simplest fix to implement. > Add close of idle connection to DFSClient and to DataNode DataXceiveServer > -- > > Key: HADOOP-2638 > URL: https://issues.apache.org/jira/browse/HADOOP-2638 > Project: Hadoop > Issue Type: Improvement > Components: dfs >Reporter: stack > > This issue is for adding timeout and shutdown of idle DFSClient <-> DataNode > connections. > Applications can have DFS usage patterns than deviate from that of MR 'norm' > where files are generally opened, sucked down as fast as is possible, and > then closed. For example, at the other extreme, hbase wants to support fast > random reading of key values over a sometimes relatively large set of > MapFiles or MapFile equivalents. To avoid paying startup costs on every > random read -- opening the file and reading in the index each time -- hbase > just keeps all of its MapFiles open all the time. > In an hbase cluster of any significant size, this can add up to lots of file > handles per process: See HADOOP-2577, " [hbase] Scaling: Too many open file > handles to datanodes" for an accounting. > Given how DFSClient and DataXceiveServer interact when random reading, and > given past observations that have the client-side file handles mostly stuck > in CLOSE_WAIT (See HADOOP-2341, 'Datanode active connections never returns to > 0'), a suggestion made up on the list today, that idle connections should be > timedout and closed, would help applications that have hbase-like access > patterns conserve file handles and allow them scale. > Below is context that comes of the mailing list under the subject: 'Re: > Multiplexing sockets in DFSClient/datanodes?' > {code} > stack wrote: > > Doug Cutting wrote: > >> RPC also tears down idle connections, which HDFS does not. I wonder how > >> much doing that alone might help your case? That would probably be much > >> simpler to implement. Both client and server must already handle > >> connection failures, so it shouldn't be too great of a change to have one > >> or both sides actively close things down if they're idle for more than a > >> few seconds. > > > > If we added tear down of idle sockets, that'd work for us and, as you > > suggest, should be easier to do than rewriting the client to use async i/o. > > Currently, random reading, its probably rare that the currently opened > > HDFS block has the wanted offset and so a tear down of the current socket > > and an open of a new one is being done anyways. > HADOOP-2346 helps with the Datanode side of the problem. We still need > DFSClient to clean up idle connections (otherwise these sockets will stay in > CLOSE_WAIT state on the client). This would require an extra thread on client > to clean up these connections. You could file a jira for it. > Raghu. > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method
[ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560086#action_12560086 ] Doug Cutting commented on HADOOP-2566: -- Do listStatus(Path[]) and globStatus(Path[]) need to be public? Does anyone use these but the globbing code? I generally prefer not to make something public without a strong need. Other than that, this looks good to me. > need FileSystem#globStatus method > - > > Key: HADOOP-2566 > URL: https://issues.apache.org/jira/browse/HADOOP-2566 > Project: Hadoop > Issue Type: Improvement > Components: fs >Reporter: Doug Cutting >Assignee: Hairong Kuang > Fix For: 0.16.0 > > Attachments: globStatus.patch, globStatus1.patch, globStatus2.patch, > globStatus3.patch, globStatus4.patch > > > To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting > performance, we must use file enumeration APIs that return FileStatus[] > rather than Path[]. Currently we have FileSystem#globPaths(), but that > method should be deprecated and replaced with a FileSystem#globStatus(). > We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the > cache in 0.17. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2638) Add close of idle connection to DFSClient and to DataNode DataXceiveServer
[ https://issues.apache.org/jira/browse/HADOOP-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560061#action_12560061 ] Doug Cutting commented on HADOOP-2638: -- Are you suggesting that MapFile#Reader change to use read(pos, buf, off, len), aka pread, exclusively? That's an interesting idea. We could implement this by adding an option to SequenceFile#Reader to always use pread. MapFile would not use this option for its index file, which is always read in its entirety, but only for its data file. It would mean that, should one seek to a key and then do sequential access, that each buffer refill would require a new connection, which would not be optimal. But that could be optimized: a buffer refill triggered by next() could switch the underlying data file to non-pread mode, while the next seek() might convert it back to pread mode. > Add close of idle connection to DFSClient and to DataNode DataXceiveServer > -- > > Key: HADOOP-2638 > URL: https://issues.apache.org/jira/browse/HADOOP-2638 > Project: Hadoop > Issue Type: Improvement > Components: dfs >Reporter: stack > > This issue is for adding timeout and shutdown of idle DFSClient <-> DataNode > connections. > Applications can have DFS usage patterns than deviate from that of MR 'norm' > where files are generally opened, sucked down as fast as is possible, and > then closed. For example, at the other extreme, hbase wants to support fast > random reading of key values over a sometimes relatively large set of > MapFiles or MapFile equivalents. To avoid paying startup costs on every > random read -- opening the file and reading in the index each time -- hbase > just keeps all of its MapFiles open all the time. > In an hbase cluster of any significant size, this can add up to lots of file > handles per process: See HADOOP-2577, " [hbase] Scaling: Too many open file > handles to datanodes" for an accounting. > Given how DFSClient and DataXceiveServer interact when random reading, and > given past observations that have the client-side file handles mostly stuck > in CLOSE_WAIT (See HADOOP-2341, 'Datanode active connections never returns to > 0'), a suggestion made up on the list today, that idle connections should be > timedout and closed, would help applications that have hbase-like access > patterns conserve file handles and allow them scale. > Below is context that comes of the mailing list under the subject: 'Re: > Multiplexing sockets in DFSClient/datanodes?' > {code} > stack wrote: > > Doug Cutting wrote: > >> RPC also tears down idle connections, which HDFS does not. I wonder how > >> much doing that alone might help your case? That would probably be much > >> simpler to implement. Both client and server must already handle > >> connection failures, so it shouldn't be too great of a change to have one > >> or both sides actively close things down if they're idle for more than a > >> few seconds. > > > > If we added tear down of idle sockets, that'd work for us and, as you > > suggest, should be easier to do than rewriting the client to use async i/o. > > Currently, random reading, its probably rare that the currently opened > > HDFS block has the wanted offset and so a tear down of the current socket > > and an open of a new one is being done anyways. > HADOOP-2346 helps with the Datanode side of the problem. We still need > DFSClient to clean up idle connections (otherwise these sockets will stay in > CLOSE_WAIT state on the client). This would require an extra thread on client > to clean up these connections. You could file a jira for it. > Raghu. > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2634) Deprecate exists() and isDir() to simplify ClientProtocol.
[ https://issues.apache.org/jira/browse/HADOOP-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560052#action_12560052 ] Doug Cutting commented on HADOOP-2634: -- +1 for removing those protocol methods. FileSystem#exists() should probably be made a concrete method in FileSystem.java, defined in terms of getFileStatus(), most existing implementations can probably be removed, and it could probably be deprecated. BTW, what is getFileStatus() supposed to do when a file does not exist? Throw an IOException or return null? The former is generally preferable, but the latter makes implementing exists() easier, since we should not use exception handling for normal program flow. I don't see a need to do this the day before 0.16 feature freeze, and it could be destabilizing. > Deprecate exists() and isDir() to simplify ClientProtocol. > -- > > Key: HADOOP-2634 > URL: https://issues.apache.org/jira/browse/HADOOP-2634 > Project: Hadoop > Issue Type: Improvement > Components: dfs >Affects Versions: 0.15.0 >Reporter: Konstantin Shvachko > > ClientProtocol can be simplified by removing two methods > {code} > public boolean exists(String src) throws IOException; > public boolean isDir(String src) throws IOException; > {code} > This is a redundant api, which can be implemented in DFSClient as convenience > methods using > {code} > public DFSFileInfo getFileInfo(String src) throws IOException; > {code} > Note that we already deprecated several Filesystem method and advised to use > getFileStatus() instead. > Should we deprecate them in 0.16? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2626) RawLocalFileStatus is badly handling URIs
[ https://issues.apache.org/jira/browse/HADOOP-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560043#action_12560043 ] Doug Cutting commented on HADOOP-2626: -- > What about this patch then ? That looks better to me, in that the returned Path is now fully qualified. Does it handle escapes any better than before? If not, 'new Path(file.toUri().getPath()).makeQualified(fs)' may do better. As Nigel indicates, some test cases would be very useful. > RawLocalFileStatus is badly handling URIs > - > > Key: HADOOP-2626 > URL: https://issues.apache.org/jira/browse/HADOOP-2626 > Project: Hadoop > Issue Type: Bug > Components: fs >Affects Versions: 0.15.2 >Reporter: Frédéric Bertin > Attachments: HADOOP-2626.patch > > > as a result, files with special characters (that get encoded when translated > to URIs) are badly handled using a local filesystem. > {{new Path(f.toURI().toString()))}} should be replaced by {{new > Path(f.toURI().getPath()))}} > IMHO, each call to {{toURI().toString()}} should be considered suspicious. > There's another one in the class CopyFiles at line 641. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2421) Release JDiff report of changes between different versions of Hadoop
[ https://issues.apache.org/jira/browse/HADOOP-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560038#action_12560038 ] Doug Cutting commented on HADOOP-2421: -- > where do i get the "OLD" one ? The approach I suggested in the Lucene issue was to have some ant properties that determine the subversion tag url of the prior version. In trunk this would point to the prior release. We'd update it in trunk after each release is made. Then the ant build script would check this out in build/ if it didn't already exist there. We could (and should) permit this to be optimized, perhaps by permitting folks to override a property so that the prior version is stored somewhere more permanent than build/, and perhaps use 'svn switch; svn update' to make sure that the cached prior version contains what we expect. > Does the user need to download and then pass the path to ant with -D option ? I'd imagined that specifying -Djdiff.prior.dir would be optional, but would help performance a lot, but we could make it mandatory, and emit an error if it's not specified. That might reduce the load on subversion somewhat. > Release JDiff report of changes between different versions of Hadoop > > > Key: HADOOP-2421 > URL: https://issues.apache.org/jira/browse/HADOOP-2421 > Project: Hadoop > Issue Type: Improvement > Components: build >Reporter: Nigel Daley >Priority: Minor > > Similar to LUCENE-1083, it would be useful to report javadoc differences (ala > [JDiff|http://www.jdiff.org/]) between Hadoop releases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2608) Reading sequence file consumes 100% cpu with maximum throughput being about 5MB/sec per process
[ https://issues.apache.org/jira/browse/HADOOP-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560033#action_12560033 ] Doug Cutting commented on HADOOP-2608: -- We might also look to see whether org.apache.hadoop.record.Utils.fromBinaryString could be made any faster. What happens if this just does 'new String(bytes, "UTF-8")'? Is the problem our homegrown UTF-8 decoder, or UTF-8 decoding in general? It'd be nice to return org.apache.io.Text instead, since that permits many string operations w/o decoding UTF-8, but that'd be a bigger change. > Reading sequence file consumes 100% cpu with maximum throughput being about > 5MB/sec per process > --- > > Key: HADOOP-2608 > URL: https://issues.apache.org/jira/browse/HADOOP-2608 > Project: Hadoop > Issue Type: Improvement > Components: io >Reporter: Runping Qi > > I did some tests on the throughput of scanning block-compressed sequence > files. > The sustained throughput was bounded at 5MB/sec per process, with the cpu of > each process maxed at 100%. > It seems to me that the cpu consumption is too high and the throughput is too > low for just scanning files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2626) RawLocalFileStatus is badly handling URIs
[ https://issues.apache.org/jira/browse/HADOOP-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2626: - Status: Open (was: Patch Available) This patch puts an unqualified path in the returned FileStatus. That's not strictly a bug, but we've found that it's always safest to return fully-qualified paths whenever we can. To convert the java.io.File to a Path, we might use new Path(file.getPath()).makeQualified(fs). Perhaps this should be added as a fileToPath method, since there's already a path2File() method. > RawLocalFileStatus is badly handling URIs > - > > Key: HADOOP-2626 > URL: https://issues.apache.org/jira/browse/HADOOP-2626 > Project: Hadoop > Issue Type: Bug > Components: fs >Affects Versions: 0.15.2 >Reporter: Frédéric Bertin > Attachments: patch-Hadoop-2626.diff > > > as a result, files with special characters (that get encoded when translated > to URIs) are badly handled using a local filesystem. > {{new Path(f.toURI().toString()))}} should be replaced by {{new > Path(f.toURI().getPath()))}} > IMHO, each call to {{toURI().toString()}} should be considered suspicious. > There's another one in the class CopyFiles at line 641. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method
[ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559723#action_12559723 ] Doug Cutting commented on HADOOP-2566: -- > Is this what we wanted? I thought we wanted other way around. I don't think it does that in all cases, but it does still appear to call getStatus() in places. I've not yet examined the logic to see if that's easily avoidable or not. But it's not a fatal problem at this point. For this release the important thing is to have globStatus() as the preferred, non-deprecated method. Once we remove the status cache, during 0.17 development, we'll soon find out whether the globStatus() implementation needs more work to perform well without a cache, and fix that before 0.17 is released. But that aspect shouldn't block this for 0.16, since we still have the cache in 0.16. > need FileSystem#globStatus method > - > > Key: HADOOP-2566 > URL: https://issues.apache.org/jira/browse/HADOOP-2566 > Project: Hadoop > Issue Type: Improvement > Components: fs >Reporter: Doug Cutting >Assignee: Hairong Kuang > Fix For: 0.16.0 > > Attachments: globStatus.patch, globStatus1.patch > > > To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting > performance, we must use file enumeration APIs that return FileStatus[] > rather than Path[]. Currently we have FileSystem#globPaths(), but that > method should be deprecated and replaced with a FileSystem#globStatus(). > We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the > cache in 0.17. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2604) [hbase] Create an HBase-specific MapFile implementation
[ https://issues.apache.org/jira/browse/HADOOP-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559676#action_12559676 ] Doug Cutting commented on HADOOP-2604: -- > it'd be nice to iterate on the keys of a MapFile without actually reading the > data SequenceFile supports that, so it shouldn't be too hard to add a next(WritableComparable) method to the MapFile API, right? > [hbase] Create an HBase-specific MapFile implementation > --- > > Key: HADOOP-2604 > URL: https://issues.apache.org/jira/browse/HADOOP-2604 > Project: Hadoop > Issue Type: Improvement > Components: contrib/hbase >Reporter: Bryan Duxbury >Priority: Minor > > Today, HBase uses the Hadoop MapFile class to store data persistently to > disk. This is convenient, as it's already done (and maintained by other > people :). However, it's beginning to look like there might be possible > performance benefits to be had from doing an HBase-specific implementation of > MapFile that incorporated some precise features. > This issue should serve as a place to track discussion about what features > might be included in such an implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2604) [hbase] Create an HBase-specific MapFile implementation
[ https://issues.apache.org/jira/browse/HADOOP-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559655#action_12559655 ] Doug Cutting commented on HADOOP-2604: -- > Exclude column family name from the file [ ... ] The column family name could be stored in the SequenceFile's metadata, no? MapFile's constructors don't currently permit one to specify metadata, but that'd be easy to add. > There is some indication that the existing MapFile implementation is > optimized for streaming access [ ... ] It shouldn't be. The problem is that mapreduce, what's primarily used to benchmark and debug Hadoop, doesn't do any random access. So it's easy for random-access-related performance problems to sneak into MapFile and HDFS. Both Nutch and HBase depend on efficient random access from Hadoop, primarily through MapFile. We need a good random-access benchmark that someone regularly executes. Perhaps one could be added to the sort benchmark suite, since that is regularly run by Yahoo!? Or someone else could start running regular HBase benchmarks on a grid somewhere? > [hbase] Create an HBase-specific MapFile implementation > --- > > Key: HADOOP-2604 > URL: https://issues.apache.org/jira/browse/HADOOP-2604 > Project: Hadoop > Issue Type: Improvement > Components: contrib/hbase >Reporter: Bryan Duxbury >Priority: Minor > > Today, HBase uses the Hadoop MapFile class to store data persistently to > disk. This is convenient, as it's already done (and maintained by other > people :). However, it's beginning to look like there might be possible > performance benefits to be had from doing an HBase-specific implementation of > MapFile that incorporated some precise features. > This issue should serve as a place to track discussion about what features > might be included in such an implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2421) Release JDiff report of changes between different versions of Hadoop
[ https://issues.apache.org/jira/browse/HADOOP-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559629#action_12559629 ] Doug Cutting commented on HADOOP-2421: -- > Is there a URI where the old documentation is available ? There are public URIs where released javadocs may be obtained, but I don't think JDiff uses the normal javadoc, but rather special javadoc output that it generates. Please see LUCENE-1083 which addresses this further. > Release JDiff report of changes between different versions of Hadoop > > > Key: HADOOP-2421 > URL: https://issues.apache.org/jira/browse/HADOOP-2421 > Project: Hadoop > Issue Type: Improvement > Components: build >Reporter: Nigel Daley >Priority: Minor > > Similar to LUCENE-1083, it would be useful to report javadoc differences (ala > [JDiff|http://www.jdiff.org/]) between Hadoop releases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559628#action_12559628 ] Doug Cutting commented on HADOOP-2567: -- > Seems that the tests are still failing. Earlier I tried on a single machine > and it worked. Did you restart the cluster running the patched code? That may be required. Did it fail with the same error? > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, > HADOOP-2567-sortvalidate.patch, HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method
[ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559299#action_12559299 ] Doug Cutting commented on HADOOP-2566: -- A few comments: - should stat2paths be a public method on FileSystem? I'd prefer it were either private or perhaps on FileUtil. - globPaths() isn't deprecated. Do we think we'll keep this, or should it be deprecated? It is handy in some cases, but, on the other hand, we'd like to force folks to examine their uses of it, since in most cases performance will become abysmal once the FileStatus cache is removed, and we don't want to surprise folks with that. Thoughts? > need FileSystem#globStatus method > - > > Key: HADOOP-2566 > URL: https://issues.apache.org/jira/browse/HADOOP-2566 > Project: Hadoop > Issue Type: Improvement > Components: fs >Reporter: Doug Cutting >Assignee: Hairong Kuang > Fix For: 0.16.0 > > Attachments: globStatus.patch > > > To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting > performance, we must use file enumeration APIs that return FileStatus[] > rather than Path[]. Currently we have FileSystem#globPaths(), but that > method should be deprecated and replaced with a FileSystem#globStatus(). > We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the > cache in 0.17. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2567: - Status: Patch Available (was: Reopened) > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, > HADOOP-2567-sortvalidate.patch, HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2567: - Attachment: HADOOP-2567-sortvalidate.patch The attached patch fixes the sort validator. Amar, can you please confirm that this fixes things for you? Thanks! > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, > HADOOP-2567-sortvalidate.patch, HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method
[ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559284#action_12559284 ] Doug Cutting commented on HADOOP-2566: -- > Should a user of globStatus() be able to distinguish between a non-existent > path and a glob that does not match any files? I'm not sure I completely understand the distinction. In one case are you passing a path without any meta characters but that does not exist, and in the other one with metacharacters but that matches no files? In any case it should probably handle this the same way globPaths() does. If the distinction is important then perhaps the non-existing file case should return null, while the non-matching expression case should return an empty array. > need FileSystem#globStatus method > - > > Key: HADOOP-2566 > URL: https://issues.apache.org/jira/browse/HADOOP-2566 > Project: Hadoop > Issue Type: Improvement > Components: fs >Reporter: Doug Cutting >Assignee: Hairong Kuang > Fix For: 0.16.0 > > > To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting > performance, we must use file enumeration APIs that return FileStatus[] > rather than Path[]. Currently we have FileSystem#globPaths(), but that > method should be deprecated and replaced with a FileSystem#globStatus(). > We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the > cache in 0.17. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2514) Trash and permissions don't mix
[ https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559216#action_12559216 ] Doug Cutting commented on HADOOP-2514: -- > I just committed this. Oops. I should have had someone review this first. Could someone please review this now? Should I revert it until someone does? > Trash and permissions don't mix > --- > > Key: HADOOP-2514 > URL: https://issues.apache.org/jira/browse/HADOOP-2514 > Project: Hadoop > Issue Type: New Feature > Components: dfs >Affects Versions: 0.16.0 >Reporter: Robert Chansler >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2514.patch > > > Shell command "rm" is really "mv" to trash with the expectation that the > server will at some point really delete the contents of trash. With the > advent of permissions, a user can "mv" folders that the user cannot "rm". The > present trash feature as implemented would allow the user to suborn the > server into deleting a folder in violation of the permissions model. > A related issue is that if anybody can mv a folder to the trash anybody else > can mv that same folder from the trash. This may be contrary to the > expectations of the user. > What is a better model for trash? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2514) Trash and permissions don't mix
[ https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2514: - Resolution: Fixed Status: Resolved (was: Patch Available) I just committed this. > Trash and permissions don't mix > --- > > Key: HADOOP-2514 > URL: https://issues.apache.org/jira/browse/HADOOP-2514 > Project: Hadoop > Issue Type: New Feature > Components: dfs >Affects Versions: 0.16.0 >Reporter: Robert Chansler >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2514.patch > > > Shell command "rm" is really "mv" to trash with the expectation that the > server will at some point really delete the contents of trash. With the > advent of permissions, a user can "mv" folders that the user cannot "rm". The > present trash feature as implemented would allow the user to suborn the > server into deleting a folder in violation of the permissions model. > A related issue is that if anybody can mv a folder to the trash anybody else > can mv that same folder from the trash. This may be contrary to the > expectations of the user. > What is a better model for trash? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2543) No-permission-checking mode for smooth transition to 0.16's permissions features.
[ https://issues.apache.org/jira/browse/HADOOP-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559165#action_12559165 ] Doug Cutting commented on HADOOP-2543: -- > setting x77 means that there is a potential window where missed files can be > co-opted by someone who shouldn't have them. Like all files are today? I don't follow. We currently have zero security. The security we're adding in this release is easy to subvert and mostly to keep folks from shooting themselves in the foot. Keeping the "window" that's wide open today open a bit longer doesn't significantly compromise anything. > all those requiring backwards compatibility should just keep perms turned off. We want folks to be able to upgrade, then use new features, without jumping through hoops. Hoops should be optional. If you wish to be able to configure a non-777 permission for after upgrade, that would be a reasonable feature, but 777 should be the default. So perhaps we need a dfs.initial.permission parameter, used by the upgrade, whose default value is 777, but that you can override along with setting dfs.permissions=false, to support the upgrade procedure you desire. But I don't think we should force all installations through that procedure in order to get a usable system. We know from experience that most folks just install the new version and expect things to work out of the box. When they don't they file bugs. > No-permission-checking mode for smooth transition to 0.16's permissions > features. > -- > > Key: HADOOP-2543 > URL: https://issues.apache.org/jira/browse/HADOOP-2543 > Project: Hadoop > Issue Type: New Feature > Components: dfs >Affects Versions: 0.15.1 >Reporter: Sanjay Radia >Assignee: Hairong Kuang > Fix For: 0.16.0 > > > In moving to 0.16, which will support permissions, a mode of no-permission > checking has been proposed to allow smooth transition to using the new > permissions feature. > The idea is that at first 0.16 will be used for a period of time with > permission checking off. > Later after the admin has changed ownership and permissions of various files, > the permission checking can be turned off. > This Jira defines what the semantics are of the no-permission-checking mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2431) Test HDFS File Permissions
[ https://issues.apache.org/jira/browse/HADOOP-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559146#action_12559146 ] Doug Cutting commented on HADOOP-2431: -- Perhaps if the exception named in the RemoteException is a class that's loaded on the client and is permitted by the method signature, then RPC should automatically try to construct an instance and throw it. But that's not what RPC does today. If you feel it should do this, please file a separate issue. The FileSystem API promises that applications which attempt to violate permissions will be thrown an AccessControlException. Today, until RPC is changed, we must intercept RemoteException and explicitly throw an AccessControlException. The fact that a particular FileSystem is implemented using RPC should be invisible to clients. > Test HDFS File Permissions > -- > > Key: HADOOP-2431 > URL: https://issues.apache.org/jira/browse/HADOOP-2431 > Project: Hadoop > Issue Type: Test > Components: test >Affects Versions: 0.15.1 >Reporter: Hairong Kuang >Assignee: Hairong Kuang > Fix For: 0.16.0 > > Attachments: HDFSPermissionSpecification6.pdf, > PermissionsTestPlan1.pdf, testDFSPermission.patch, testDFSPermission1.patch > > > This jira is intended to provide junit tests to HADOOP-1298. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2385) Validate configuration parameters
[ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559140#action_12559140 ] Doug Cutting commented on HADOOP-2385: -- > I would prefer creating new classes solely dedicated to configuration logic [ > ... ] I think this varies, case-by-case. For a complex subsystem like HDFS, it may make sense to have dedicated configuration classes. For a standalone classes, like an InputFormat or compression codec, it probably makes sense to put configuration accessors directly on the class in question. > Validate configuration parameters > - > > Key: HADOOP-2385 > URL: https://issues.apache.org/jira/browse/HADOOP-2385 > Project: Hadoop > Issue Type: Improvement > Components: dfs >Affects Versions: 0.16.0 >Reporter: Robert Chansler > > Configuration parameters should be fully validated before name nodes or data > nodes begin service. > Required parameters must be present. > Required and optional parameters must have values of proper type and range. > Undefined parameters must not be present. > (I was recently observing some confusion whose root cause was a mis-spelled > parameter.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2543) No-permission-checking mode for smooth transition to 0.16's permissions features.
[ https://issues.apache.org/jira/browse/HADOOP-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558879#action_12558879 ] Doug Cutting commented on HADOOP-2543: -- > Explicitly tightening them is more backwards compatible, but from the > security point of view, explicitly loosening them is safer. Yes, and for this upgrade, back-compatibility is more important than immediately increasing security. We don't decrease security any, and folks can easily increase security after the upgrade by tightening permissions. But we don't want things to be broken as soon as they upgrade by automatically tightening permissions. What I'm proposing is essentially the use-case you describe above for using dfs.permission=false, but without setting that: after the upgrade everything is permitted, and folks can start restricting access, but without having to restart the cluster. I think for most sites this is simpler, less surprising and sufficient. > No-permission-checking mode for smooth transition to 0.16's permissions > features. > -- > > Key: HADOOP-2543 > URL: https://issues.apache.org/jira/browse/HADOOP-2543 > Project: Hadoop > Issue Type: New Feature > Components: dfs >Affects Versions: 0.15.1 >Reporter: Sanjay Radia >Assignee: Hairong Kuang > Fix For: 0.16.0 > > > In moving to 0.16, which will support permissions, a mode of no-permission > checking has been proposed to allow smooth transition to using the new > permissions feature. > The idea is that at first 0.16 will be used for a period of time with > permission checking off. > Later after the admin has changed ownership and permissions of various files, > the permission checking can be turned off. > This Jira defines what the semantics are of the no-permission-checking mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2431) Test HDFS File Permissions
[ https://issues.apache.org/jira/browse/HADOOP-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558874#action_12558874 ] Doug Cutting commented on HADOOP-2431: -- This tests that permission check failures throw a RemoteException wrapping an AccessControlException. Shouldn't permission check failures throw an AccessControlException directly? DistributedFileSystem or DFSClient should catch the RemoteException and, when it wraps an AccessControlException, throw one of those so that client code sees that, no? Should I file a separate issue for this? > Test HDFS File Permissions > -- > > Key: HADOOP-2431 > URL: https://issues.apache.org/jira/browse/HADOOP-2431 > Project: Hadoop > Issue Type: Test > Components: test >Affects Versions: 0.15.1 >Reporter: Hairong Kuang >Assignee: Hairong Kuang > Fix For: 0.16.0 > > Attachments: HDFSPermissionSpecification6.pdf, > PermissionsTestPlan1.pdf, testDFSPermission.patch, testDFSPermission1.patch > > > This jira is intended to provide junit tests to HADOOP-1298. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2514) Trash and permissions don't mix
[ https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2514: - Status: Patch Available (was: Open) > Trash and permissions don't mix > --- > > Key: HADOOP-2514 > URL: https://issues.apache.org/jira/browse/HADOOP-2514 > Project: Hadoop > Issue Type: New Feature > Components: dfs >Affects Versions: 0.16.0 >Reporter: Robert Chansler >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2514.patch > > > Shell command "rm" is really "mv" to trash with the expectation that the > server will at some point really delete the contents of trash. With the > advent of permissions, a user can "mv" folders that the user cannot "rm". The > present trash feature as implemented would allow the user to suborn the > server into deleting a folder in violation of the permissions model. > A related issue is that if anybody can mv a folder to the trash anybody else > can mv that same folder from the trash. This may be contrary to the > expectations of the user. > What is a better model for trash? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2514) Trash and permissions don't mix
[ https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2514: - Attachment: HADOOP-2514.patch Here's a patch that implements Sanjay's option 2. > Trash and permissions don't mix > --- > > Key: HADOOP-2514 > URL: https://issues.apache.org/jira/browse/HADOOP-2514 > Project: Hadoop > Issue Type: New Feature > Components: dfs >Affects Versions: 0.16.0 >Reporter: Robert Chansler >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2514.patch > > > Shell command "rm" is really "mv" to trash with the expectation that the > server will at some point really delete the contents of trash. With the > advent of permissions, a user can "mv" folders that the user cannot "rm". The > present trash feature as implemented would allow the user to suborn the > server into deleting a folder in violation of the permissions model. > A related issue is that if anybody can mv a folder to the trash anybody else > can mv that same folder from the trash. This may be contrary to the > expectations of the user. > What is a better model for trash? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558782#action_12558782 ] Doug Cutting commented on HADOOP-2567: -- > Currently the trunk does not pass the sort validation tests. Can you please attach details, like a log or stack trace? Or at least instructions on how to reproduce this. Thanks! Also, it might be good to run a scaled-down version of the sort benchmark & validation during unit testing, so that we exercise those codepaths and find things like this sooner. > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, > HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2543) No-permission-checking mode for smooth transition to 0.16's permissions features.
[ https://issues.apache.org/jira/browse/HADOOP-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558770#action_12558770 ] Doug Cutting commented on HADOOP-2543: -- > 1) all files and directories will be owned by the super user and super group; That seems fine. > 2) the permission of the files is set to be 0600 and the permission of the > directories is set to be 0700. The use of dfs.permissions=false should be optional, no? Folks should be able to upgrade and use the filesystem as before, but this would break that. The default protection after upgrade should continue to be 777 and that folks should need to explicitly tighten permissions rather than explicitly loosen them. > No-permission-checking mode for smooth transition to 0.16's permissions > features. > -- > > Key: HADOOP-2543 > URL: https://issues.apache.org/jira/browse/HADOOP-2543 > Project: Hadoop > Issue Type: New Feature > Components: dfs >Affects Versions: 0.15.1 >Reporter: Sanjay Radia >Assignee: Hairong Kuang > Fix For: 0.16.0 > > > In moving to 0.16, which will support permissions, a mode of no-permission > checking has been proposed to allow smooth transition to using the new > permissions feature. > The idea is that at first 0.16 will be used for a period of time with > permission checking off. > Later after the admin has changed ownership and permissions of various files, > the permission checking can be turned off. > This Jira defines what the semantics are of the no-permission-checking mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2543) No-permission-checking mode for smooth transition to 0.16's permissions features.
[ https://issues.apache.org/jira/browse/HADOOP-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558745#action_12558745 ] Doug Cutting commented on HADOOP-2543: -- How does this differ from the way that dfs.permissions=false works already? Is this a documentation issue, or are there functional changes required? > No-permission-checking mode for smooth transition to 0.16's permissions > features. > -- > > Key: HADOOP-2543 > URL: https://issues.apache.org/jira/browse/HADOOP-2543 > Project: Hadoop > Issue Type: New Feature > Components: dfs >Affects Versions: 0.15.1 >Reporter: Sanjay Radia >Assignee: Hairong Kuang > Fix For: 0.16.0 > > > In moving to 0.16, which will support permissions, a mode of no-permission > checking has been proposed to allow smooth transition to using the new > permissions feature. > The idea is that at first 0.16 will be used for a period of time with > permission checking off. > Later after the admin has changed ownership and permissions of various files, > the permission checking can be turned off. > This Jira defines what the semantics are of the no-permission-checking mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method
[ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558742#action_12558742 ] Doug Cutting commented on HADOOP-2566: -- > For example, globPath("/user/*/data") needs only to listPath("/user"). But listPaths() is not a primitive, it is a utility method defined in terms of listStatus(). So this example is calling listStatus("/user") and then stripping the list of FileStatus objects down to a list of Path objects. We should remove that stripping, or at least make it optional. To make it optional, the primitive glob operation should be globStatus, and globPaths() should become a utility method defined in terms of globStatus(). > Some of shell commands like delete, copy, and rename use globPath but don't > need FileStatus. These actually all do need the FileStatus. They need to find out whether each file is a directory or not, to find out when to recurse. Copy also needs other attributes so that they can be set on the copy too. So we'll end up needing to rework these. We will not remove globPaths() in this release, so these commands do not need to change right now. But before we can remove the cache we need to examine every place that calls globPaths to check whether these must be converted to use globStatus. That's why we're deprecating globPaths(), to force folks to do this. Then, in 0.17, we can remove the cache from trunk, and start identifying all the problems. But we want users who upgrade to 0.17 to be forwarned, and to have an API that supports cache-free use before we remove the cache, so that they can upgrade to 0.17 more smoothly. > need FileSystem#globStatus method > - > > Key: HADOOP-2566 > URL: https://issues.apache.org/jira/browse/HADOOP-2566 > Project: Hadoop > Issue Type: Improvement > Components: fs >Reporter: Doug Cutting >Assignee: Hairong Kuang > Fix For: 0.16.0 > > > To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting > performance, we must use file enumeration APIs that return FileStatus[] > rather than Path[]. Currently we have FileSystem#globPaths(), but that > method should be deprecated and replaced with a FileSystem#globStatus(). > We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the > cache in 0.17. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2346) DataNode should have timeout on socket writes.
[ https://issues.apache.org/jira/browse/HADOOP-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558709#action_12558709 ] Doug Cutting commented on HADOOP-2346: -- This looks nice! How well does it work? SocketInputStream and SocketOutputStream seem like fine names, but should they be nested classes in IOUtils, or perhaps independent classes in the 'net' package? Also, we might make the error messages in the exceptions a bit more informative, e.g., including the address the socket is connected to, the timeout, etc. > DataNode should have timeout on socket writes. > -- > > Key: HADOOP-2346 > URL: https://issues.apache.org/jira/browse/HADOOP-2346 > Project: Hadoop > Issue Type: Bug > Components: dfs >Affects Versions: 0.15.1 >Reporter: Raghu Angadi >Assignee: Raghu Angadi > Attachments: HADOOP-2346.patch > > > If a client opens a file and stops reading in the middle, DataNode thread > writing the data could be stuck forever. For DataNode sockets we set read > timeout but not write timeout. I think we should add a write(data, timeout) > method in IOUtils that assumes it the underlying FileChannel is non-blocking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2066) filenames with ':' colon throws java.lang.IllegalArgumentException
[ https://issues.apache.org/jira/browse/HADOOP-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558701#action_12558701 ] Doug Cutting commented on HADOOP-2066: -- > I'm not at all sure it makes sense to define what is a valid filename based > on a URI library. The URI standard provides a good interchange syntax for file names. But we shouldn't let it limit what names are possible in various filesystems as we do today: we should support their full range, using escapes where necessary. Unfortunately, with the current API, we can't tell when a character needs to be escaped or when it is intended as a URI meta-character. The problem is that we construct paths in FileSystem independent code, so we don't know how to escape things. Perhaps the solution is to remove the public Path constructor and force all Paths to be created by a FileSystem#createPath method, so that they can be escaped appropriately. Thus, when running on Windows, if one passes a string with unescaped backslashes to LocalFileSystem#createPath(), the backslashes would be interpreted as directory separators, while on Linux or HDFS they'd be treated as literals. Unescaped slashes in a Path URI will always be directory separators, since that's the URI standard we're using for interchange. > filenames with ':' colon throws java.lang.IllegalArgumentException > -- > > Key: HADOOP-2066 > URL: https://issues.apache.org/jira/browse/HADOOP-2066 > Project: Hadoop > Issue Type: Bug > Components: dfs >Reporter: lohit vijayarenu > Attachments: 2066_20071022.patch, HADOOP-2066.patch > > > File names containing colon ":" throws java.lang.IllegalArgumentException > while LINUX file system supports it. > $ hadoop dfs -put ./testfile-2007-09-24-03:00:00.gz filenametest > Exception in thread "main" java.lang.IllegalArgumentException: > java.net.URISyntaxException: Relative path in absolute > URI: testfile-2007-09-24-03:00:00.gz > at org.apache.hadoop.fs.Path.initialize(Path.java:140) > at org.apache.hadoop.fs.Path.(Path.java:126) > at org.apache.hadoop.fs.Path.(Path.java:50) > at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:273) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:117) > at > org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:776) > at > org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:757) > at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:116) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:1229) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:187) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:1342) > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > testfile-2007-09-24-03:00:00.gz > at java.net.URI.checkPath(URI.java:1787) > at java.net.URI.(URI.java:735) > at org.apache.hadoop.fs.Path.initialize(Path.java:137) > ... 10 more > Path(String pathString) when given a filename which contains ':' treats it as > URI and selects anything before ':' as > scheme, which in this case is clearly not a valid scheme. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2567: - Resolution: Fixed Status: Resolved (was: Patch Available) I just committed this. > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, > HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method
[ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558129#action_12558129 ] Doug Cutting commented on HADOOP-2566: -- Globbing is implemented on top of listPaths() which is implemented on top of listStatus(). The primitive globbing API should not throw away that status information. It should keep it so that glob clients which need it do not have to call getStatus() for each file that matches. Currently the cache of FileStatus hides the cost of these getStatus() calls, but that cache will break things once files and their status can change. So we need globStatus() before we can remove the cache. FileInputFormat, for example, uses globPaths() to list files matching the input specification, then it uses getStatus() on each matching path when building splits. This must change to call globStatus() before the cache is removed. Long-term, globPaths() and listPaths() may perhaps still be useful as a utility methods implemented in terms of of globStatus() and listStatus(), but since most current users of these will be broken performancewise once the cache is removed, we should deprecate them now to strongly encourage folks to stop using them before that cache is removed, to give fair warning. > need FileSystem#globStatus method > - > > Key: HADOOP-2566 > URL: https://issues.apache.org/jira/browse/HADOOP-2566 > Project: Hadoop > Issue Type: Improvement > Components: fs >Reporter: Doug Cutting >Assignee: Hairong Kuang > Fix For: 0.16.0 > > > To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting > performance, we must use file enumeration APIs that return FileStatus[] > rather than Path[]. Currently we have FileSystem#globPaths(), but that > method should be deprecated and replaced with a FileSystem#globStatus(). > We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the > cache in 0.17. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method
[ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558071#action_12558071 ] Doug Cutting commented on HADOOP-2566: -- No, we need 'FileStatus[] globStatus(Path pattern)' instead of 'Path[] globPaths(Path pattern)'. > need FileSystem#globStatus method > - > > Key: HADOOP-2566 > URL: https://issues.apache.org/jira/browse/HADOOP-2566 > Project: Hadoop > Issue Type: Improvement > Components: fs >Reporter: Doug Cutting >Assignee: Hairong Kuang > Fix For: 0.16.0 > > > To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting > performance, we must use file enumeration APIs that return FileStatus[] > rather than Path[]. Currently we have FileSystem#globPaths(), but that > method should be deprecated and replaced with a FileSystem#globStatus(). > We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the > cache in 0.17. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2560) Combining multiple input blocks into one mapper
[ https://issues.apache.org/jira/browse/HADOOP-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557821#action_12557821 ] Doug Cutting commented on HADOOP-2560: -- > Much simpler to make the late binding decision to bundle them. The algorithm I outlined above could be done incrementally, rather than all up-front: - N is the desired splits/task - build map for the job inputs - when a node asks for a task, pop up to N splits off its list to form a task - if a node has no more splits, pop splits from other nodes - as each split is popped, remove it from other map entries This is essentially the existing algorithm, except that we allocate more than one split per task. In fact, the existing algorithm handles lots of other subtle cases like speculative execution, task failure, etc. So the best way to implement this is probably to use the existing algorithm multiple times per task, etc. Earlier I'd spoke of implementing this up front, when constructing splits. But if it's done this way, then we needn't actually change public APIs or InputFormats. Tasks could simply internally be changed to execute a list of splits rather than a single split. > Combining multiple input blocks into one mapper > --- > > Key: HADOOP-2560 > URL: https://issues.apache.org/jira/browse/HADOOP-2560 > Project: Hadoop > Issue Type: Bug >Reporter: Runping Qi > > Currently, an input split contains a consecutive chunk of input file, which > by default, corresponding to a DFS block. > This may lead to a large number of mapper tasks if the input data is large. > This leads to the following problems: > 1. Shuffling cost: since the framework has to move M * R map output segments > to the nodes running reducers, > larger M means larger shuffling cost. > 2. High JVM initialization overhead > 3. Disk fragmentation: larger number of map output files means lower read > throughput for accessing them. > Ideally, you want to keep the number of mappers to no more than 16 times the > number of nodes in the cluster. > To achive that, we can increase the input split size. However, if a split > span over more than one dfs block, > you lose the data locality scheduling benefits. > One way to address this problem is to combine multiple input blocks with the > same rack into one split. > If in average we combine B blocks into one split, then we will reduce the > number of mappers by a factor of B. > Since all the blocks for one mapper share a rack, thus we can benefit from > rack-aware scheduling. > Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2510) Map-Reduce 2.0
[ https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557782#action_12557782 ] Doug Cutting commented on HADOOP-2510: -- > The JobScheduler should not be part of the MapReduce sub-project. If we can build MapReduce on top of some shared infrastructure, e.g., a JobScheduler, that is independently maintained and used by a larger community than just the mapreduce community, then that might be a good thing. So I'd love to see a proposal that defines a generally useful primitive layer, with examples of multiple, useful systems that can be layered on top of it, including mapreduce. Also, when this is implemented, I would argue that at least one of these other higher-level systems should be implemented too, in addition to mapreduce, to prove the generality of the lower-level system. Things intended to be reusable that are not in fact reused tend not to actually be reusable. Whether this more primitive layer should be a library that we use to build mapreduce daemons, or a service is an interesting question. The latter would better permit a cluster to be shared by mapreduce and non-mapreduce tasks. > Map-Reduce 2.0 > -- > > Key: HADOOP-2510 > URL: https://issues.apache.org/jira/browse/HADOOP-2510 > Project: Hadoop > Issue Type: Improvement > Components: mapred >Reporter: Arun C Murthy > > We, at Yahoo!, have been using Hadoop-On-Demand as the resource > provisioning/scheduling mechanism. > With HoD the user uses a self-service system to ask-for a set of nodes. HoD > allocates these from a global pool and also provisions a private Map-Reduce > cluster for the user. She then runs her jobs and shuts the cluster down via > HoD when done. All user-private clusters use the same humongous, static HDFS > (e.g. 2k node HDFS). > More details about HoD are available here: HADOOP-1301. > > h3. Motivation > The current deployment (Hadoop + HoD) has a couple of implications: > * _Non-optimal Cluster Utilization_ >1. Job-private Map-Reduce clusters imply that the user-cluster potentially > could be *idle* for atleast a while before being detected and shut-down. >2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with > much-smaller no. of reduces; with maps being light and quick and reduces > being i/o heavy and longer-running. Users typically allocate clusters > depending on the no. of maps (i.e. input size) which leads to the scenario > where all the maps are done (idle nodes in the cluster) and the few reduces > are chugging along. Right now, we do not have the ability to shrink the > HoD'ed Map-Reduce clusters which would alleviate this issue. > * _Impact on data-locality_ > With the current setup of a static, large HDFS and much smaller (5/10/20/50 > node) clusters there is a good chance of losing one of Map-Reduce's primary > features: ability to execute tasks on the datanodes where the input splits > are located. In fact, we have seen the data-local tasks go down to 20-25 > percent in the GridMix benchmarks, from the 95-98 percent we see on the > randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a > synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware > Map-Reduce) helps significantly here. > > Primarily, the notion of *job-level scheduling* leading to private clusers, > as opposed to *task-level scheduling*, is a good peg to hang-on the majority > of the blame. > Keeping the above factors in mind, here are some thoughts on how to > re-structure Hadoop Map-Reduce to solve some of these issues. > > h3. State of the Art > As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD > for a bit) does provide task-level scheduling; however as it exists today, > it's scalability to tens-of-thousands of user-jobs, per-week, is in question. > Lets review it's current architecture and main components: > * JobTracker: It does both *task-scheduling* and *task-monitoring* > (tasktrackers send task-statuses via periodic heartbeats), which implies it > is fairly loaded. It is also a _single-point of failure_ in the Map-Reduce > framework i.e. its failure implies that all the jobs in the system fail. This > means a static, large Map-Reduce cluster is fairly susceptible and a definite > suspect. Clearly HoD solves this by having per-job clusters, albeit with the > above drawbacks. > * TaskTracker: The slave in the system which executes one task at-a-time > under directions from the JobTracker. > * JobClient: The per-job client which just submits the job and polls the > JobTracker for status. > > h3. Proposal - Map-Reduce 2.0 > The primary idea is to move to task-level scheduling and static Map-Reduce > clusters (so as to maintain the same storage cluster and compute clust
[jira] Commented: (HADOOP-2510) Map-Reduce 2.0
[ https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557778#action_12557778 ] Doug Cutting commented on HADOOP-2510: -- > the logic required to run multiple MapReduce jobs is different enough from > running a single > MapReduce job that separate daemons would provide a much cleaner > implementation. If it would improve the implementation, then we should better layer the logic. I have no problem with that. But layering the logic within a single address space will yield a more reliable system than distributing it across multiple hosts. It may be less scalable to keep all the logic in a single service, but I have yet to be convinced that the jobtracker is a scalability bottleneck. So, sure, let's clean up the jobtracker with modular decomposition, but I have yet to see how running different modules of the jobtracker on different hosts will improve things. > Map-Reduce 2.0 > -- > > Key: HADOOP-2510 > URL: https://issues.apache.org/jira/browse/HADOOP-2510 > Project: Hadoop > Issue Type: Improvement > Components: mapred >Reporter: Arun C Murthy > > We, at Yahoo!, have been using Hadoop-On-Demand as the resource > provisioning/scheduling mechanism. > With HoD the user uses a self-service system to ask-for a set of nodes. HoD > allocates these from a global pool and also provisions a private Map-Reduce > cluster for the user. She then runs her jobs and shuts the cluster down via > HoD when done. All user-private clusters use the same humongous, static HDFS > (e.g. 2k node HDFS). > More details about HoD are available here: HADOOP-1301. > > h3. Motivation > The current deployment (Hadoop + HoD) has a couple of implications: > * _Non-optimal Cluster Utilization_ >1. Job-private Map-Reduce clusters imply that the user-cluster potentially > could be *idle* for atleast a while before being detected and shut-down. >2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with > much-smaller no. of reduces; with maps being light and quick and reduces > being i/o heavy and longer-running. Users typically allocate clusters > depending on the no. of maps (i.e. input size) which leads to the scenario > where all the maps are done (idle nodes in the cluster) and the few reduces > are chugging along. Right now, we do not have the ability to shrink the > HoD'ed Map-Reduce clusters which would alleviate this issue. > * _Impact on data-locality_ > With the current setup of a static, large HDFS and much smaller (5/10/20/50 > node) clusters there is a good chance of losing one of Map-Reduce's primary > features: ability to execute tasks on the datanodes where the input splits > are located. In fact, we have seen the data-local tasks go down to 20-25 > percent in the GridMix benchmarks, from the 95-98 percent we see on the > randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a > synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware > Map-Reduce) helps significantly here. > > Primarily, the notion of *job-level scheduling* leading to private clusers, > as opposed to *task-level scheduling*, is a good peg to hang-on the majority > of the blame. > Keeping the above factors in mind, here are some thoughts on how to > re-structure Hadoop Map-Reduce to solve some of these issues. > > h3. State of the Art > As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD > for a bit) does provide task-level scheduling; however as it exists today, > it's scalability to tens-of-thousands of user-jobs, per-week, is in question. > Lets review it's current architecture and main components: > * JobTracker: It does both *task-scheduling* and *task-monitoring* > (tasktrackers send task-statuses via periodic heartbeats), which implies it > is fairly loaded. It is also a _single-point of failure_ in the Map-Reduce > framework i.e. its failure implies that all the jobs in the system fail. This > means a static, large Map-Reduce cluster is fairly susceptible and a definite > suspect. Clearly HoD solves this by having per-job clusters, albeit with the > above drawbacks. > * TaskTracker: The slave in the system which executes one task at-a-time > under directions from the JobTracker. > * JobClient: The per-job client which just submits the job and polls the > JobTracker for status. > > h3. Proposal - Map-Reduce 2.0 > The primary idea is to move to task-level scheduling and static Map-Reduce > clusters (so as to maintain the same storage cluster and compute cluster > paradigm) as a way to directly tackle the two main issues illustrated above. > Clearly, we will have to get around the existing problems, especially w.r.t. > scalability and reliability. > The proposal is to re-work Hadoop Map-Reduce to
[jira] Commented: (HADOOP-2573) limit running tasks per job
[ https://issues.apache.org/jira/browse/HADOOP-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557775#action_12557775 ] Doug Cutting commented on HADOOP-2573: -- > The limit could be max(static_limit, number of cores in cluster / # active > jobs) Jinx! > limit running tasks per job > --- > > Key: HADOOP-2573 > URL: https://issues.apache.org/jira/browse/HADOOP-2573 > Project: Hadoop > Issue Type: New Feature > Components: mapred >Reporter: Doug Cutting > Fix For: 0.17.0 > > > It should be possible to specify a limit to the number of tasks per job > permitted to run simultaneously. If, for example, you have a cluster of 50 > nodes, with 100 map task slots and 100 reduce task slots, and the configured > limit is 25 simultaneous tasks/job, then four or more jobs will be able to > run at a time. This will permit short jobs to pass longer-running jobs. > This also avoids some problems we've seen with HOD, where nodes are > underutilized in their tail, and it should permit improved input locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2573) limit running tasks per job
[ https://issues.apache.org/jira/browse/HADOOP-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557767#action_12557767 ] Doug Cutting commented on HADOOP-2573: -- I think a static limit for all jobs would be useful and best to implement first. After some experience with this, we would be better able to address its shortcomings. Possible future extensions might be: - dynamically altering the limit, e.g., limit=max(min.tasks.per.job, numSlots/numJobsOutstanding) -- ramping up the limit slowly, so that a users's sequential jobs don't have all their slots immediately taken when one job completes -- ramping down the limit slowly, so that tasks are given an opportunity to finish normally before they are killed. - incorporating job priority into the limit > limit running tasks per job > --- > > Key: HADOOP-2573 > URL: https://issues.apache.org/jira/browse/HADOOP-2573 > Project: Hadoop > Issue Type: New Feature > Components: mapred >Reporter: Doug Cutting > Fix For: 0.17.0 > > > It should be possible to specify a limit to the number of tasks per job > permitted to run simultaneously. If, for example, you have a cluster of 50 > nodes, with 100 map task slots and 100 reduce task slots, and the configured > limit is 25 simultaneous tasks/job, then four or more jobs will be able to > run at a time. This will permit short jobs to pass longer-running jobs. > This also avoids some problems we've seen with HOD, where nodes are > underutilized in their tail, and it should permit improved input locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2573) limit running tasks per job
[ https://issues.apache.org/jira/browse/HADOOP-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557758#action_12557758 ] Doug Cutting commented on HADOOP-2573: -- Some discussion of this issue may be found at: http://www.nabble.com/question-about-file-glob-in-hadoop-0.15-tt14702242.html#a14741794 > limit running tasks per job > --- > > Key: HADOOP-2573 > URL: https://issues.apache.org/jira/browse/HADOOP-2573 > Project: Hadoop > Issue Type: New Feature > Components: mapred >Reporter: Doug Cutting > Fix For: 0.17.0 > > > It should be possible to specify a limit to the number of tasks per job > permitted to run simultaneously. If, for example, you have a cluster of 50 > nodes, with 100 map task slots and 100 reduce task slots, and the configured > limit is 25 simultaneous tasks/job, then four or more jobs will be able to > run at a time. This will permit short jobs to pass longer-running jobs. > This also avoids some problems we've seen with HOD, where nodes are > underutilized in their tail, and it should permit improved input locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HADOOP-2574) bugs in mapred tutorial
bugs in mapred tutorial --- Key: HADOOP-2574 URL: https://issues.apache.org/jira/browse/HADOOP-2574 Project: Hadoop Issue Type: Bug Components: documentation Reporter: Doug Cutting Fix For: 0.15.3, 0.16.0 Sam Pullara sends me: {noformat} Phu was going through the WordCount example... lines 52 and 53 should have args[0] and args[1]: http://lucene.apache.org/hadoop/docs/current/mapred_tutorial.html The javac and jar command are also wrong, they don't include the directories for the packages, should be: $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d classes WordCount.java $ jar -cvf /usr/joe/wordcount.jar WordCount.class -C classes . {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2573) limit running tasks per job
[ https://issues.apache.org/jira/browse/HADOOP-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557724#action_12557724 ] Doug Cutting commented on HADOOP-2573: -- This addresses issues raised in HADOOP-2510. > limit running tasks per job > --- > > Key: HADOOP-2573 > URL: https://issues.apache.org/jira/browse/HADOOP-2573 > Project: Hadoop > Issue Type: New Feature > Components: mapred >Reporter: Doug Cutting > Fix For: 0.17.0 > > > It should be possible to specify a limit to the number of tasks per job > permitted to run simultaneously. If, for example, you have a cluster of 50 > nodes, with 100 map task slots and 100 reduce task slots, and the configured > limit is 25 simultaneous tasks/job, then four or more jobs will be able to > run at a time. This will permit short jobs to pass longer-running jobs. > This also avoids some problems we've seen with HOD, where nodes are > underutilized in their tail, and it should permit improved input locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2510) Map-Reduce 2.0
[ https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557722#action_12557722 ] Doug Cutting commented on HADOOP-2510: -- I added HADOOP-2573 for the approach I propose above. > Map-Reduce 2.0 > -- > > Key: HADOOP-2510 > URL: https://issues.apache.org/jira/browse/HADOOP-2510 > Project: Hadoop > Issue Type: Improvement > Components: mapred >Reporter: Arun C Murthy > > We, at Yahoo!, have been using Hadoop-On-Demand as the resource > provisioning/scheduling mechanism. > With HoD the user uses a self-service system to ask-for a set of nodes. HoD > allocates these from a global pool and also provisions a private Map-Reduce > cluster for the user. She then runs her jobs and shuts the cluster down via > HoD when done. All user-private clusters use the same humongous, static HDFS > (e.g. 2k node HDFS). > More details about HoD are available here: HADOOP-1301. > > h3. Motivation > The current deployment (Hadoop + HoD) has a couple of implications: > * _Non-optimal Cluster Utilization_ >1. Job-private Map-Reduce clusters imply that the user-cluster potentially > could be *idle* for atleast a while before being detected and shut-down. >2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with > much-smaller no. of reduces; with maps being light and quick and reduces > being i/o heavy and longer-running. Users typically allocate clusters > depending on the no. of maps (i.e. input size) which leads to the scenario > where all the maps are done (idle nodes in the cluster) and the few reduces > are chugging along. Right now, we do not have the ability to shrink the > HoD'ed Map-Reduce clusters which would alleviate this issue. > * _Impact on data-locality_ > With the current setup of a static, large HDFS and much smaller (5/10/20/50 > node) clusters there is a good chance of losing one of Map-Reduce's primary > features: ability to execute tasks on the datanodes where the input splits > are located. In fact, we have seen the data-local tasks go down to 20-25 > percent in the GridMix benchmarks, from the 95-98 percent we see on the > randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a > synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware > Map-Reduce) helps significantly here. > > Primarily, the notion of *job-level scheduling* leading to private clusers, > as opposed to *task-level scheduling*, is a good peg to hang-on the majority > of the blame. > Keeping the above factors in mind, here are some thoughts on how to > re-structure Hadoop Map-Reduce to solve some of these issues. > > h3. State of the Art > As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD > for a bit) does provide task-level scheduling; however as it exists today, > it's scalability to tens-of-thousands of user-jobs, per-week, is in question. > Lets review it's current architecture and main components: > * JobTracker: It does both *task-scheduling* and *task-monitoring* > (tasktrackers send task-statuses via periodic heartbeats), which implies it > is fairly loaded. It is also a _single-point of failure_ in the Map-Reduce > framework i.e. its failure implies that all the jobs in the system fail. This > means a static, large Map-Reduce cluster is fairly susceptible and a definite > suspect. Clearly HoD solves this by having per-job clusters, albeit with the > above drawbacks. > * TaskTracker: The slave in the system which executes one task at-a-time > under directions from the JobTracker. > * JobClient: The per-job client which just submits the job and polls the > JobTracker for status. > > h3. Proposal - Map-Reduce 2.0 > The primary idea is to move to task-level scheduling and static Map-Reduce > clusters (so as to maintain the same storage cluster and compute cluster > paradigm) as a way to directly tackle the two main issues illustrated above. > Clearly, we will have to get around the existing problems, especially w.r.t. > scalability and reliability. > The proposal is to re-work Hadoop Map-Reduce to make it suitable for a large, > static cluster. > Here is an overview of how its main components would look like: > * JobTracker: Turn the JobTracker into a pure task-scheduler, a global one. > Lets call this the *JobScheduler* henceforth. Clearly (data-locality aware) > Maui/Moab are candidates for being the scheduler, in which case, the > JobScheduler is just a thin wrapper around them. > * TaskTracker: These stay as before, without some minor changes as > illustrated later in the piece. > * JobClient: Fatten up the JobClient my putting a lot more intelligence into > it. Enhance it to talk to the JobTracker to ask for available TaskTrackers > and then contact them to schedule and m
[jira] Created: (HADOOP-2573) limit running tasks per job
limit running tasks per job --- Key: HADOOP-2573 URL: https://issues.apache.org/jira/browse/HADOOP-2573 Project: Hadoop Issue Type: New Feature Components: mapred Reporter: Doug Cutting Fix For: 0.17.0 It should be possible to specify a limit to the number of tasks per job permitted to run simultaneously. If, for example, you have a cluster of 50 nodes, with 100 map task slots and 100 reduce task slots, and the configured limit is 25 simultaneous tasks/job, then four or more jobs will be able to run at a time. This will permit short jobs to pass longer-running jobs. This also avoids some problems we've seen with HOD, where nodes are underutilized in their tail, and it should permit improved input locality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557714#action_12557714 ] Doug Cutting commented on HADOOP-2567: -- > Would it be make sense to use UserGroupInformation to determine the home dir? Yes, someday. Long-term, username's should be filesystem-specific. But we don't yet have an API to get the username for a particular filesystem. Once that's added, it should be returned as a UserGroupInformation and used to determine the home directory, but until then, I think this is not worth adding. Note that this patch does not change how the home directory in HDFS is computed, it only adds a method to expose the home directory already implicit in HDFS. Changing how we compute it should perhaps be the subject of another issue. > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, > HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2298) ant target without source and docs
[ https://issues.apache.org/jira/browse/HADOOP-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557709#action_12557709 ] Doug Cutting commented on HADOOP-2298: -- > No one has mentioned any specific name for the target and "minimal" tarfile. I think such things are typically called "binary" or "bin" distributions, no? > ant target without source and docs > --- > > Key: HADOOP-2298 > URL: https://issues.apache.org/jira/browse/HADOOP-2298 > Project: Hadoop > Issue Type: Improvement > Components: build >Reporter: Gautam Kowshik > Attachments: 2298.patch.1 > > > Can we have an ant target or a -D option to build the hadoop tar without the > source and documentation? This brings down the tar size from 11.5 MB to 5.6 > MB. This would speed up distribution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2560) Combining multiple input blocks into one mapper
[ https://issues.apache.org/jira/browse/HADOOP-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557481#action_12557481 ] Doug Cutting commented on HADOOP-2560: -- > The current system takes full advantage of mapping jobs to nodes dynamically. Currently we compute and cache the mapping once per job, and then base all subsequent decisions on that cache. We get ~99% job locality with that 'static' information. Things should be about about the same if we group things, unless I'm missing something. > One could perhaps do something like what you suggest dynamically in the JT > when a TT requests a new job. That's a possible enhancement. I'm not sure it's required for good localization, and it would add significant load to the namenode. > Combining multiple input blocks into one mapper > --- > > Key: HADOOP-2560 > URL: https://issues.apache.org/jira/browse/HADOOP-2560 > Project: Hadoop > Issue Type: Bug >Reporter: Runping Qi > > Currently, an input split contains a consecutive chunk of input file, which > by default, corresponding to a DFS block. > This may lead to a large number of mapper tasks if the input data is large. > This leads to the following problems: > 1. Shuffling cost: since the framework has to move M * R map output segments > to the nodes running reducers, > larger M means larger shuffling cost. > 2. High JVM initialization overhead > 3. Disk fragmentation: larger number of map output files means lower read > throughput for accessing them. > Ideally, you want to keep the number of mappers to no more than 16 times the > number of nodes in the cluster. > To achive that, we can increase the input split size. However, if a split > span over more than one dfs block, > you lose the data locality scheduling benefits. > One way to address this problem is to combine multiple input blocks with the > same rack into one split. > If in average we combine B blocks into one split, then we will reduce the > number of mappers by a factor of B. > Since all the blocks for one mapper share a rack, thus we can benefit from > rack-aware scheduling. > Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2385) Validate configuration parameters
[ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557478#action_12557478 ] Doug Cutting commented on HADOOP-2385: -- > This means that the configuration classes should be public then, right? Yes, if the parameters they access should be publicly accessible. One might argue that certain parameters are only consumed internally and don't need public accessors, but, more typically, parameter accessors are on public classes. > And it doesn't matter where the get/setters are. > Particularly we can combine all of them in one class > or even place them in the Configuration class. Is it what you want? They shouldn't be all in one place or all in Configuration for the same reason that we don't put everything in a single file: we should attempt to keep related things together, to localize changes. So an HDFS-specific parameter accessor should be on an HDFS-specific class. How fine-grained we localize isn't clear. Generally, finer is better: find the most-specific public class that encompasses the use and add the accessor there. So if something's only used in the Datanode, but used in a few different classes there, then it might best be on Datanode. > What I meant is that we keep placing logically independent > code inside e.g. FSNamesystem, which makes it bigger, while it could easily > be made a separate class. > And configuration is just an example of such logically independent part. If configuration stuff is not specific to FSNamesystem (i.e., logically independent) then it shouldn't go there. If it is specific to FSNamesystem then it could go there, or perhaps on a new class that's used only by FSNamesystem, e.g., FSNamesystemParams. If it's used equally by FSNamesystem and other classes then it could either go on an existing shared class (e.g., Namenode) or a new shared class (NamenodeParams). > Validate configuration parameters > - > > Key: HADOOP-2385 > URL: https://issues.apache.org/jira/browse/HADOOP-2385 > Project: Hadoop > Issue Type: Improvement > Components: dfs >Affects Versions: 0.16.0 >Reporter: Robert Chansler > > Configuration parameters should be fully validated before name nodes or data > nodes begin service. > Required parameters must be present. > Required and optional parameters must have values of proper type and range. > Undefined parameters must not be present. > (I was recently observing some confusion whose root cause was a mis-spelled > parameter.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2552) enable hdfs permission checking by default
[ https://issues.apache.org/jira/browse/HADOOP-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2552: - Resolution: Fixed Status: Resolved (was: Patch Available) I just committed this. > enable hdfs permission checking by default > -- > > Key: HADOOP-2552 > URL: https://issues.apache.org/jira/browse/HADOOP-2552 > Project: Hadoop > Issue Type: Improvement > Components: dfs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2552.patch > > > We should enable permission checking in dfs by default. Currently, on > upgrade, all file permissions are 777, so this is a back-compatible change. > After an upgrade folks can change owners and groups and limit permissions, > and things will work as expected. > The current default, dfs.permissions=false, gives inconsistent behaviour: > permissions are displayed in 'ls' and returned by the FileSystem APIs, but > they're not enforced. In future releases we will certainly want > dfs.permissions=true to be the default, and making it so now will thus also > avoid an incompatible change. > dfs.permissions=false should be an optional, non-default configuration that > some sites may decide to use. It is further defined in HADOOP-2543. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2567: - Attachment: HADOOP-2567-2.patch Fix another place that assumed working directory wasn't fully qualified. > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2567-1.patch, HADOOP-2567-2.patch, > HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2404) HADOOP-2185 breaks compatibility with hadoop-0.15.0
[ https://issues.apache.org/jira/browse/HADOOP-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557441#action_12557441 ] Doug Cutting commented on HADOOP-2404: -- > "some processing" of exactly these parameters was introduced in HADOOP-1085. > I opposed it then. You just committed it. That looks like I made a mistake. Mea culpa. I don't recall the details, but in those days I was doing a lot of commits and my reviews may have sufferered. > But I do not agree that they should be introdueced in this patch, which will > lead to massive changes I disagree that the changes are massive. They're easy to locate (points where the modified parameters are accessed) not that many locations, and only affect a line or two of code at each location. I also disagree that the size alone of the change should be a significant factor here. The change is simple enough that it will not be destabilizing. The places changed are not likely to be touched by many other pending patches, so it should not create many conflicts. > This argument is going on for almost a month now. I do not find it productive. > I mean, people can have different opinions, what do you do with that. If committers cannot reach consensus, then the issue can be taken to the PMC, although that seems like overkill in this case. If you decline to fix it in a way that others approve, and it is a blocker, then someone else must develop a patch that we can all agree on before we can make the release. I think you are the best qualified person to fix this. I could try to generate a patch, but it would probably take me a lot longer than it would you and I would be more likely to make subtle errors, since I am less intimate with the changes. > HADOOP-2185 breaks compatibility with hadoop-0.15.0 > --- > > Key: HADOOP-2404 > URL: https://issues.apache.org/jira/browse/HADOOP-2404 > Project: Hadoop > Issue Type: Bug > Components: conf >Affects Versions: 0.16.0 >Reporter: Arun C Murthy >Assignee: Konstantin Shvachko >Priority: Blocker > Fix For: 0.16.0 > > Attachments: ConfigConvert.patch, ConfigConvert2.patch, > ConfigurationConverter.patch > > > HADOOP-2185 removed the following configuration parameters: > {noformat} > dfs.secondary.info.port > dfs.datanode.port > dfs.info.port > mapred.job.tracker.info.port > tasktracker.http.port > {noformat} > and changed the following configuration parameters: > {noformat} > dfs.secondary.info.bindAddress > dfs.datanode.bindAddress > dfs.info.bindAddress > mapred.job.tracker.info.bindAddress > mapred.task.tracker.report.bindAddress > tasktracker.http.bindAddress > {noformat} > without a backward-compatibility story. > Lots are applications/cluster-configurations are prone to fail hence, we need > a way to keep things working as-is for 0.16.0 and remove them for 0.17.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2385) Validate configuration parameters
[ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557431#action_12557431 ] Doug Cutting commented on HADOOP-2385: -- > Why setters need to be static? Users need to, e.g., be able to set HDFS parameters on a JobConf. We can get away with a single subclass of Configuration that has setters, but once we add a second, it would be impossible to create a single configuration instance that can configure multiple components. > Why per-package, not per-component? That's fine too. You seemed to be complaining that classes were too specific for this case, so I said I was okay with per-package if you thought that more appropirate here, although perhaps that's too general for your taste in this case, and you'd rather separate, e.g., Namenode from Datanode parameters. That's fine with me too. However I don't find the argument that FSNamesystem is already too big compelling. That's a separate issue: it should perhaps be decomposed into multiple classes, and when that's done, configuration accessors might move around, but if there are FSNamesystem-specific configuration accessors then I'd argue they belong in FSNamesystem, regardless of that class's current size. > Validate configuration parameters > - > > Key: HADOOP-2385 > URL: https://issues.apache.org/jira/browse/HADOOP-2385 > Project: Hadoop > Issue Type: Improvement > Components: dfs >Affects Versions: 0.16.0 >Reporter: Robert Chansler > > Configuration parameters should be fully validated before name nodes or data > nodes begin service. > Required parameters must be present. > Required and optional parameters must have values of proper type and range. > Undefined parameters must not be present. > (I was recently observing some confusion whose root cause was a mis-spelled > parameter.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2528) check permissions for job inputs and outputs
[ https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557428#action_12557428 ] Doug Cutting commented on HADOOP-2528: -- > In this particular jira, is it OK that we create the output directory by the > job client? +1 That would make this patch very simple, not much more than one line! However, we should not lose some of the changes to FileSystem.java, those deprecating all of the listPaths() signatures, and adding a listStatus(Path, Filter) signature. Should we add a separate issue for those, or fix them as a part of HADOOP-2566? > check permissions for job inputs and outputs > > > Key: HADOOP-2528 > URL: https://issues.apache.org/jira/browse/HADOOP-2528 > Project: Hadoop > Issue Type: Improvement > Components: mapred >Reporter: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2528-0.patch, HADOOP-2528-1.patch > > > On job submission, filesystem permissions should be checked to ensure that > the input directory is readable and that the output directory is writable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2567: - Attachment: HADOOP-2567-1.patch Fix a test case that assumed getWorkingDir() was not fully qualified. Note that because of this change (working dirs are now fully qualified) this change should probably be included in the "incompatible" section. > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2567-1.patch, HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2560) Combining multiple input blocks into one mapper
[ https://issues.apache.org/jira/browse/HADOOP-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557412#action_12557412 ] Doug Cutting commented on HADOOP-2560: -- > It is not going to work to combine splits statically because block replicas > are not co-resident. If the number of blocks in the job input is hugely greater than the number of nodes, then it should be easy to find nodes that have a large number of blocks locally, and group the blocks thusly into tasks. If a task fails, then the re-execution might not be local, but most tasks don't fail, and we can arrange things so that the first node a task is assigned to has all its blocks. Or am i missing something? Consider the following algorithm: - build and maps for the job input files - N is the desired blocks/task - for (node : nodes) pop N blocks off each nodes list and add it to the list of tasks - as each block is popped, also remove it from all other node's lists, using the other map to accelerate this - repeat until nodes have fewer than N blocks, then emit tasks with fewer than N blocks as the tail of the job Wouldn't that work? > Combining multiple input blocks into one mapper > --- > > Key: HADOOP-2560 > URL: https://issues.apache.org/jira/browse/HADOOP-2560 > Project: Hadoop > Issue Type: Bug >Reporter: Runping Qi > > Currently, an input split contains a consecutive chunk of input file, which > by default, corresponding to a DFS block. > This may lead to a large number of mapper tasks if the input data is large. > This leads to the following problems: > 1. Shuffling cost: since the framework has to move M * R map output segments > to the nodes running reducers, > larger M means larger shuffling cost. > 2. High JVM initialization overhead > 3. Disk fragmentation: larger number of map output files means lower read > throughput for accessing them. > Ideally, you want to keep the number of mappers to no more than 16 times the > number of nodes in the cluster. > To achive that, we can increase the input split size. However, if a split > span over more than one dfs block, > you lose the data locality scheduling benefits. > One way to address this problem is to combine multiple input blocks with the > same rack into one split. > If in average we combine B blocks into one split, then we will reduce the > number of mappers by a factor of B. > Since all the blocks for one mapper share a rack, thus we can benefit from > rack-aware scheduling. > Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2567: - Status: Patch Available (was: Open) > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting reassigned HADOOP-2567: Assignee: Doug Cutting > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method
[ https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2567: - Attachment: HADOOP-2567.patch Patch that implements this. Also makes both home and working dirs fully qualified. > add FileSystem#getHomeDirectory() method > > > Key: HADOOP-2567 > URL: https://issues.apache.org/jira/browse/HADOOP-2567 > Project: Hadoop > Issue Type: New Feature > Components: fs >Reporter: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2567.patch > > > The FileSystem API would benefit from a getHomeDirectory() method. > The default implementation would return "/user/$USER/". > RawLocalFileSystem would return System.getProperty("user.home"). > HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HADOOP-2567) add FileSystem#getHomeDirectory() method
add FileSystem#getHomeDirectory() method Key: HADOOP-2567 URL: https://issues.apache.org/jira/browse/HADOOP-2567 Project: Hadoop Issue Type: New Feature Components: fs Reporter: Doug Cutting Fix For: 0.16.0 The FileSystem API would benefit from a getHomeDirectory() method. The default implementation would return "/user/$USER/". RawLocalFileSystem would return System.getProperty("user.home"). HADOOP-2514 can use this to implement per-user trash. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2268) JobControl classes should use interfaces rather than implemenations
[ https://issues.apache.org/jira/browse/HADOOP-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557403#action_12557403 ] Doug Cutting commented on HADOOP-2268: -- +1 This patch looks fine to me. > JobControl classes should use interfaces rather than implemenations > --- > > Key: HADOOP-2268 > URL: https://issues.apache.org/jira/browse/HADOOP-2268 > Project: Hadoop > Issue Type: Improvement > Components: mapred >Affects Versions: 0.15.0 >Reporter: Adrian Woodhead >Assignee: Adrian Woodhead >Priority: Minor > Fix For: 0.16.0 > > Attachments: HADOOP-2268-1.patch, HADOOP-2268-2.patch, > HADOOP-2268-3.patch, HADOOP-2268-4.patch > > > See HADOOP-2202 for background on this issue. Arun C. Murthy agrees that when > possible it is preferable to program against the interface rather than a > concrete implementation (more flexible, allows for changes of the > implementation in future etc.) JobControl currently exposes running, waiting, > ready, successful and dependent jobs as ArrayList rather than List. I propose > to change this to List. > I will code up a patch for this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HADOOP-2566) need FileSystem#globStatus method
[ https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting reassigned HADOOP-2566: Assignee: Hairong Kuang > need FileSystem#globStatus method > - > > Key: HADOOP-2566 > URL: https://issues.apache.org/jira/browse/HADOOP-2566 > Project: Hadoop > Issue Type: Improvement > Components: fs >Reporter: Doug Cutting >Assignee: Hairong Kuang > Fix For: 0.16.0 > > > To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting > performance, we must use file enumeration APIs that return FileStatus[] > rather than Path[]. Currently we have FileSystem#globPaths(), but that > method should be deprecated and replaced with a FileSystem#globStatus(). > We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the > cache in 0.17. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HADOOP-2514) Trash and permissions don't mix
[ https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting reassigned HADOOP-2514: Assignee: Doug Cutting > Trash and permissions don't mix > --- > > Key: HADOOP-2514 > URL: https://issues.apache.org/jira/browse/HADOOP-2514 > Project: Hadoop > Issue Type: New Feature > Components: dfs >Affects Versions: 0.16.0 >Reporter: Robert Chansler >Assignee: Doug Cutting > Fix For: 0.16.0 > > > Shell command "rm" is really "mv" to trash with the expectation that the > server will at some point really delete the contents of trash. With the > advent of permissions, a user can "mv" folders that the user cannot "rm". The > present trash feature as implemented would allow the user to suborn the > server into deleting a folder in violation of the permissions model. > A related issue is that if anybody can mv a folder to the trash anybody else > can mv that same folder from the trash. This may be contrary to the > expectations of the user. > What is a better model for trash? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2564) NameNode to blat total number of files and blocks
[ https://issues.apache.org/jira/browse/HADOOP-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557379#action_12557379 ] Doug Cutting commented on HADOOP-2564: -- This was included in HADOOP-2447, just committed, no? If that's satisfactory, we can close this as "duplicate". > NameNode to blat total number of files and blocks > - > > Key: HADOOP-2564 > URL: https://issues.apache.org/jira/browse/HADOOP-2564 > Project: Hadoop > Issue Type: Improvement >Reporter: Marco Nicosia >Priority: Minor > Fix For: 0.17.0 > > > Right now, the namenode reports lots of rates (block read per sec, removed > per sec, etc etc) but it doesn't actually report how many files and blocks > total exist in the system. It'd be great if we could have this, so that our > reporting systems can show the growth trends over time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HADOOP-2566) need FileSystem#globStatus method
need FileSystem#globStatus method - Key: HADOOP-2566 URL: https://issues.apache.org/jira/browse/HADOOP-2566 Project: Hadoop Issue Type: Improvement Components: fs Reporter: Doug Cutting Fix For: 0.16.0 To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting performance, we must use file enumeration APIs that return FileStatus[] rather than Path[]. Currently we have FileSystem#globPaths(), but that method should be deprecated and replaced with a FileSystem#globStatus(). We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the cache in 0.17. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HADOOP-2565) DFSPath cache of FileStatus can become stale
DFSPath cache of FileStatus can become stale Key: HADOOP-2565 URL: https://issues.apache.org/jira/browse/HADOOP-2565 Project: Hadoop Issue Type: Bug Affects Versions: 0.16.0 Reporter: Doug Cutting Fix For: 0.17.0 Paths returned from DFS internally cache their FileStatus, so that getStatus(Path) does not require another RPC. This cache is never refreshed and become stale, resulting in program error. This should not be fixed until FileSystem#listStatus() is removed by HADOOP-2563, and user code is thus no longer dependent on this cache for good performance. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HADOOP-2563) Remove deprecated FileSystem#listPaths()
Remove deprecated FileSystem#listPaths() Key: HADOOP-2563 URL: https://issues.apache.org/jira/browse/HADOOP-2563 Project: Hadoop Issue Type: Improvement Components: fs Reporter: Doug Cutting Fix For: 0.17.0 FileSystem#listPaths() has been deprecated for a few releases, and we should now remove it, upgrading everything to use FileSystem#listStatus(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2404) HADOOP-2185 breaks compatibility with hadoop-0.15.0
[ https://issues.apache.org/jira/browse/HADOOP-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557369#action_12557369 ] Doug Cutting commented on HADOOP-2404: -- > I thought and still think it is more fair not to provide any backward > compatibility at all [ ... ] Huh? That's a change from what you stated in [#action_12550831]. No one is asking for 100% back-compatibility here, but rather for a reasonable interpretation where possible of configuration parameters that have changed. At the very least, if we can easily detect that someone is using a feature that has been incompatibly changed, we should attempt to emit a warning, and not just let things mysteriously fail, no? > I understand your irritation on the configuration issues, but I don't > understand why blame my or equally any other patch for not dealing with them. You imply that I am asking this issue to fix a few instances of a widespread problem unrelated to the issue. That is not the case. The issue is both specific and related. If a config parameter is only read in a single place, then no accessor method is needed. If it is simply read in multiple places, then an accessor method is nice, since it helps prevent misspellings and makes things easier if the parameter ever requires more processing, but not mandatory. Once some processing is needed for every access to a parameter then an accessor method is required, since otherwise we'd replicate non-trivial program logic. HADOOP-2185 pushed several parameters past this threshold, since back-compatibility processing is now required when these parameters are accessed, and thus accessor methods must be added. > HADOOP-2185 breaks compatibility with hadoop-0.15.0 > --- > > Key: HADOOP-2404 > URL: https://issues.apache.org/jira/browse/HADOOP-2404 > Project: Hadoop > Issue Type: Bug > Components: conf >Affects Versions: 0.16.0 >Reporter: Arun C Murthy >Assignee: Konstantin Shvachko >Priority: Blocker > Fix For: 0.16.0 > > Attachments: ConfigConvert.patch, ConfigConvert2.patch, > ConfigurationConverter.patch > > > HADOOP-2185 removed the following configuration parameters: > {noformat} > dfs.secondary.info.port > dfs.datanode.port > dfs.info.port > mapred.job.tracker.info.port > tasktracker.http.port > {noformat} > and changed the following configuration parameters: > {noformat} > dfs.secondary.info.bindAddress > dfs.datanode.bindAddress > dfs.info.bindAddress > mapred.job.tracker.info.bindAddress > mapred.task.tracker.report.bindAddress > tasktracker.http.bindAddress > {noformat} > without a backward-compatibility story. > Lots are applications/cluster-configurations are prone to fail hence, we need > a way to keep things working as-is for 0.16.0 and remove them for 0.17.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2385) Validate configuration parameters
[ https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557359#action_12557359 ] Doug Cutting commented on HADOOP-2385: -- > The Configuration itself should remain the same for each component. > It just exposes get methods specific to the component. Yes, that would work for getters, but not for setters. In many cases we need setters too, and it would be confusing to implement getters and setters using different styles. Setters are best implemented as static methods, thus, for symmetry, getters must be also. > I do not support the idea of placing static getters for configuration > parameters in the (top-level) component I'm okay having per-package config classes (e.g.m DFSConfig) that centralizes configuration setters and getters for that package, since, in some cases, the classes which consume these (e.g., FSNamesystem) are not public classes. > Validate configuration parameters > - > > Key: HADOOP-2385 > URL: https://issues.apache.org/jira/browse/HADOOP-2385 > Project: Hadoop > Issue Type: Improvement > Components: dfs >Affects Versions: 0.16.0 >Reporter: Robert Chansler > > Configuration parameters should be fully validated before name nodes or data > nodes begin service. > Required parameters must be present. > Required and optional parameters must have values of proper type and range. > Undefined parameters must not be present. > (I was recently observing some confusion whose root cause was a mis-spelled > parameter.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2560) Combining multiple input blocks into one mapper
[ https://issues.apache.org/jira/browse/HADOOP-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557340#action_12557340 ] Doug Cutting commented on HADOOP-2560: -- > combine multiple input blocks with the same rack into one split [ ... ] That makes good sense to me. The new Split class could look a lot like MultiFileSplit, but would additionally support a 'getStart(int)' method. So perhaps MultiFileSplit could be extended for this purpose. FileInputFormat could be modified to emit these when the number of splits would otherwise exceed some threshold. But then all subclasses of FileInputFormat would need to be modified to be able to accept these. That wouldn't be too hard. FileInputFormat could implement getRecordReader(InputSplit) to break out the sub-splits, then call a new method, getRecordReader(FileSplit). All existing subclasses could then just change the signature of their getRecordReader implementations in order to support the new feature. > Combining multiple input blocks into one mapper > --- > > Key: HADOOP-2560 > URL: https://issues.apache.org/jira/browse/HADOOP-2560 > Project: Hadoop > Issue Type: Bug >Reporter: Runping Qi > > Currently, an input split contains a consecutive chunk of input file, which > by default, corresponding to a DFS block. > This may lead to a large number of mapper tasks if the input data is large. > This leads to the following problems: > 1. Shuffling cost: since the framework has to move M * R map output segments > to the nodes running reducers, > larger M means larger shuffling cost. > 2. High JVM initialization overhead > 3. Disk fragmentation: larger number of map output files means lower read > throughput for accessing them. > Ideally, you want to keep the number of mappers to no more than 16 times the > number of nodes in the cluster. > To achive that, we can increase the input split size. However, if a split > span over more than one dfs block, > you lose the data locality scheduling benefits. > One way to address this problem is to combine multiple input blocks with the > same rack into one split. > If in average we combine B blocks into one split, then we will reduce the > number of mappers by a factor of B. > Since all the blocks for one mapper share a rack, thus we can benefit from > rack-aware scheduling. > Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2510) Map-Reduce 2.0
[ https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557066#action_12557066 ] Doug Cutting commented on HADOOP-2510: -- The stated goals of this design are to improve things when running mapreduce on a subset of the nodes of a cluster, when HDFS is run on all nodes. The current approach is to run new mapreduce daemons (jobtracker and tasktrackers) for the subset. The problems are that this does not utilize nodes as fully as they could be (e.g., during the tail of a job) and it inhibits data locality optimizations. The proposed solution is to split the jobtracker daemon in two, one shared, long-running daemon, and a per job daemon. My concern with this approach is that adding a new kind of daemon considerably complicates things. New classes of daemons exponentially increase the number of failure modes that must be tested and debugged. This could be warranted if it permitted greater sharing of functionality between systems, reducing the amount of functionality that we must maintain. For example, we could add a general node allocation system, and built map-reduce on top of this. But for that to be a convincingly independent layer, we'd need to demonstrate that we can build other, non-mapreduce systems on it, e.g., perhaps hdfs, but this proposal doesn't seem to offer that. I propose that the stated problems can be more simply and directly solved without adding a new daemon, but with the existing integrated system. We can add a job parameter naming the maximum number of nodes that will be used simultaneously. Then a single jobtracker for the entire cluster can schedule tasks for multiple jobs at a time, each running on different subsets of nodes. A cluster of 1000 nodes might be configured to limit jobs to 200 nodes each. As jobs are winding down and no longer use all 200 nodes, the next job can use those nodes, improving utilization, the first stated goal of this issue. The entire cluster is available to the jobtracker for scheduling, so that it can arrange to place tasks on nodes where their data is local, addressing the second stated goal of this issue. Splitting the jobtracker sounds like it would simplify things, since it would result in two simpler services, but distributed systems are more impacted by the number of kinds of services than by the complexity of a single service. Thus perhaps the jobtracker could be better structured internally, to separate concerns within its implementation, but I do not yet see an argument for moving them to separate services. That seems like it will only make things less reliable: the same logic running in two daemons that could run equivalently in a single daemon. > Map-Reduce 2.0 > -- > > Key: HADOOP-2510 > URL: https://issues.apache.org/jira/browse/HADOOP-2510 > Project: Hadoop > Issue Type: Improvement > Components: mapred >Reporter: Arun C Murthy > > We, at Yahoo!, have been using Hadoop-On-Demand as the resource > provisioning/scheduling mechanism. > With HoD the user uses a self-service system to ask-for a set of nodes. HoD > allocates these from a global pool and also provisions a private Map-Reduce > cluster for the user. She then runs her jobs and shuts the cluster down via > HoD when done. All user-private clusters use the same humongous, static HDFS > (e.g. 2k node HDFS). > More details about HoD are available here: HADOOP-1301. > > h3. Motivation > The current deployment (Hadoop + HoD) has a couple of implications: > * _Non-optimal Cluster Utilization_ >1. Job-private Map-Reduce clusters imply that the user-cluster potentially > could be *idle* for atleast a while before being detected and shut-down. >2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with > much-smaller no. of reduces; with maps being light and quick and reduces > being i/o heavy and longer-running. Users typically allocate clusters > depending on the no. of maps (i.e. input size) which leads to the scenario > where all the maps are done (idle nodes in the cluster) and the few reduces > are chugging along. Right now, we do not have the ability to shrink the > HoD'ed Map-Reduce clusters which would alleviate this issue. > * _Impact on data-locality_ > With the current setup of a static, large HDFS and much smaller (5/10/20/50 > node) clusters there is a good chance of losing one of Map-Reduce's primary > features: ability to execute tasks on the datanodes where the input splits > are located. In fact, we have seen the data-local tasks go down to 20-25 > percent in the GridMix benchmarks, from the 95-98 percent we see on the > randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a > synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware > Map-Reduc
[jira] Commented: (HADOOP-1824) want InputFormat for zip files
[ https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557061#action_12557061 ] Doug Cutting commented on HADOOP-1824: -- > 2. Override the getSplits() method to read each file's InputStream I think getSplits() should construct a split for each element of java.util.zip.ZipFile#entries(). > 3. Create FileSplits [ ... ] We should probably extend FileSplit or InputSplit specifically for zip files. The fields needed per split are the archive file's path and the path of the file within the archive. I don't think there's much point in supporting splits smaller than a file within the zip archive, so start and end offsets are not required here. > 4. Implement class ZipRecordReader to read each zip entry in its split Using LineRecordReader. We should be able to use LineRecordReader directly, passing its constructor the result of ZipFile#getInputStream(). > want InputFormat for zip files > -- > > Key: HADOOP-1824 > URL: https://issues.apache.org/jira/browse/HADOOP-1824 > Project: Hadoop > Issue Type: New Feature > Components: mapred >Reporter: Doug Cutting > > HDFS is inefficient with large numbers of small files. Thus one might pack > many small files into large, compressed, archives. But, for efficient > map-reduce operation, it is desireable to be able to split inputs into > smaller chunks, with one or more small original file per split. The zip > format, unlike tar, permits enumeration of files in the archive without > scanning the entire archive. Thus a zip InputFormat could efficiently permit > splitting large archives into splits that contain one or more archived files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2552) enable hdfs permission checking by default
[ https://issues.apache.org/jira/browse/HADOOP-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2552: - Attachment: HADOOP-2552.patch > enable hdfs permission checking by default > -- > > Key: HADOOP-2552 > URL: https://issues.apache.org/jira/browse/HADOOP-2552 > Project: Hadoop > Issue Type: Improvement > Components: dfs >Reporter: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2552.patch > > > We should enable permission checking in dfs by default. Currently, on > upgrade, all file permissions are 777, so this is a back-compatible change. > After an upgrade folks can change owners and groups and limit permissions, > and things will work as expected. > The current default, dfs.permissions=false, gives inconsistent behaviour: > permissions are displayed in 'ls' and returned by the FileSystem APIs, but > they're not enforced. In future releases we will certainly want > dfs.permissions=true to be the default, and making it so now will thus also > avoid an incompatible change. > dfs.permissions=false should be an optional, non-default configuration that > some sites may decide to use. It is further defined in HADOOP-2543. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2552) enable hdfs permission checking by default
[ https://issues.apache.org/jira/browse/HADOOP-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2552: - Assignee: Doug Cutting Status: Patch Available (was: Open) > enable hdfs permission checking by default > -- > > Key: HADOOP-2552 > URL: https://issues.apache.org/jira/browse/HADOOP-2552 > Project: Hadoop > Issue Type: Improvement > Components: dfs >Reporter: Doug Cutting >Assignee: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2552.patch > > > We should enable permission checking in dfs by default. Currently, on > upgrade, all file permissions are 777, so this is a back-compatible change. > After an upgrade folks can change owners and groups and limit permissions, > and things will work as expected. > The current default, dfs.permissions=false, gives inconsistent behaviour: > permissions are displayed in 'ls' and returned by the FileSystem APIs, but > they're not enforced. In future releases we will certainly want > dfs.permissions=true to be the default, and making it so now will thus also > avoid an incompatible change. > dfs.permissions=false should be an optional, non-default configuration that > some sites may decide to use. It is further defined in HADOOP-2543. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HADOOP-2552) enable hdfs permission checking by default
enable hdfs permission checking by default -- Key: HADOOP-2552 URL: https://issues.apache.org/jira/browse/HADOOP-2552 Project: Hadoop Issue Type: Improvement Components: dfs Reporter: Doug Cutting Fix For: 0.16.0 We should enable permission checking in dfs by default. Currently, on upgrade, all file permissions are 777, so this is a back-compatible change. After an upgrade folks can change owners and groups and limit permissions, and things will work as expected. The current default, dfs.permissions=false, gives inconsistent behaviour: permissions are displayed in 'ls' and returned by the FileSystem APIs, but they're not enforced. In future releases we will certainly want dfs.permissions=true to be the default, and making it so now will thus also avoid an incompatible change. dfs.permissions=false should be an optional, non-default configuration that some sites may decide to use. It is further defined in HADOOP-2543. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2532) Add to MapFile a getClosest that returns key that comes just-before if key not present (Currently does just-after only).
[ https://issues.apache.org/jira/browse/HADOOP-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557008#action_12557008 ] Doug Cutting commented on HADOOP-2532: -- +1 This looks fine to me. > Add to MapFile a getClosest that returns key that comes just-before if key > not present (Currently does just-after only). > > > Key: HADOOP-2532 > URL: https://issues.apache.org/jira/browse/HADOOP-2532 > Project: Hadoop > Issue Type: New Feature >Reporter: stack >Assignee: stack >Priority: Minor > Fix For: 0.16.0 > > Attachments: getclosestbefore-v2.patch, getclosestbefore-v3.patch, > getclosestbefore.patch > > > The list of regions that make up a table in hbase are effectively kept in a > mapfile. Regions are identified by the first row contained by that region. > To find the region that contains a particular row, we need to be able to > search the mapfile of regions to find the closest matching row that falls > just-before the searched-for key rather than the just-after that is current > mapfile getClosest behavior. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2206) Design/implement a general log-aggregation framework for Hadoop
[ https://issues.apache.org/jira/browse/HADOOP-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556991#action_12556991 ] Doug Cutting commented on HADOOP-2206: -- > I got Arun a copy of Scribe a few months ago. Any chance you can post a public copy somewhere? > Design/implement a general log-aggregation framework for Hadoop > --- > > Key: HADOOP-2206 > URL: https://issues.apache.org/jira/browse/HADOOP-2206 > Project: Hadoop > Issue Type: New Feature > Components: dfs, mapred >Reporter: Arun C Murthy >Assignee: Arun C Murthy > Fix For: 0.17.0 > > > I'd like to propose a log-aggregation framework which facilitates collection, > aggregation and storage of the logs of the Hadoop Map-Reduce framework and > user-jobs in HDFS. Clearly the design/implementation of this framework is > heavily influenced and limited by Hadoop itself for e.g. lack of appends, not > too many small files (think: stdout/stderr/syslog of each map/reduce task) > and so on. > This framework will be especially useful once HoD (HADOOP-1301) is used to > provision dynamic, per-user, Map-Reduce clusters. > h4. Requirements: > * Store the various logs to a configurable location in the Hadoop > Distributed FileSystem > ** User task logs (stdout, stderr, syslog) > ** Map-Reduce daemons' logs (JobTracker and TaskTracker) > * Integrate well with Hadoop and ensure no adverse performance impact on the > Map-Reduce framework. > * It must not use a HDFS file (or more!) per a task, which would swamp the > NameNode capabilities. > * The aggregation system must be distributed and reliable. > * Facilities/tools to read the aggregated logs. > * The aggregated logs should be compressed. > h4. Architecture: > Here is a high-level overview of the log-aggregation framework: > h5. Logging > * Provision a cloud of log-aggregators in the cluster (outside of the Hadoop > cluster, running on the subset of nodes in the cluster). Lets call each one > in the cloud as a Log Aggregator i.e. LA. > * Each LA writes out 2 files per Map-Reduce cluster: an index file and a data > file. The LA maintains one directory per Map-Reduce cluster on HDFS. > * The index file format is simple: > ** streamid (_streamid_ is either daemon identifier e.g. > tasktracker_foo.bar.com:57891 or $jobid-$taskid-(stdout|stderr|syslog) or > individual task-logs) > ** timestamp > ** logs-data start offset > ** no. of bytes > * Each Hadoop daemon (JT/TT) is given the entire list of LAs in the cluster. > * Each daemon picks one LA (at random) from the list, opens an exclusive > stream with the LA after identifying itself (i.e. ${daemonid}) and sends it's > logs. In case of error/failure to log it just connects to another LA as above > and starts logging to it. > * The logs are sent to the LA by a new log4j appender. The appender provides > some amount of buffering on the client-side. > * Implement a feature in the TaskTracker which lets it use the same appender > to send out the userlogs (stdout/stderr/syslog) to the LA after task > completion. This is important to ensure that logging to the LA at runtime > doesn't hurt the task's performance (see HADOOP-1553). The TaskTracker picks > an LA per task in a manner similar to the one it uses for it's own logs, > identifies itself (<${jobid}, ${taskid}, {stdout|stderr|syslog}>) and streams > the entire task-log at one go. In fact we can pick different LAs for each of > the task's stdout, stderr and syslog logs - each an exclusive stream to a > single LA. > * The LA buffers some amount of data in memory (say 16K) and then flushes > that data to the HDFS file (per LA per cluster) after writing out an entry to > the index file. > * The LA periodically purges old logs (monthly, fortnightly or weekly as > today). > h5. Getting the logged information > The main requirement is to implement a simple set of tools to query the LA > (i.e. the index/data files on HDFS) to glean the logged information. > If we can think of each Map-Reduce cluster's logs as a set of archives (i.e. > one file per cluster per LA used) we need the ability to query the > log-archive to figure out the available streams and the ability to get one > entire stream or a subset of time based on timestamp-ranges. Essentially > these are simple tools which parse the index files of each LA (for a given > Hadoop cluster) and return the required information. > h6. Query for available streams > The query just returns all the available streams in an cluster-log archive > identified by the HDFS path. > It looks something like this for a cluster with 3 nodes which ran 2 jobs, > first of which had 2 maps, 1 reduce and the second had 1 map, 1 reduce: > {noformat} >$ la -query /log-aggregation/cluster-20071113 >Ava
[jira] Commented: (HADOOP-1873) User permissions for Map/Reduce
[ https://issues.apache.org/jira/browse/HADOOP-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556989#action_12556989 ] Doug Cutting commented on HADOOP-1873: -- +1 this looks good to me. Thanks for your patience in working this out! > User permissions for Map/Reduce > --- > > Key: HADOOP-1873 > URL: https://issues.apache.org/jira/browse/HADOOP-1873 > Project: Hadoop > Issue Type: Improvement >Reporter: Raghu Angadi >Assignee: Hairong Kuang > Attachments: mapred.patch, mapred2.patch, mapred3.patch, > mapred4.patch, mapred5.patch, mapred6.patch, mapred7.patch > > > HADOOP-1298 and HADOOP-1701 add permissions and pluggable security for DFS > files and DFS accesses. Same users permission should work for Map/Reduce jobs > as well. > User persmission should propegate from client to map/reduce tasks and all the > file operations should be subject to user permissions. This is transparent to > the user (i.e. no changes to user code should be required). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2551) hadoop-env.sh needs finer granularity
[ https://issues.apache.org/jira/browse/HADOOP-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556986#action_12556986 ] Doug Cutting commented on HADOOP-2551: -- I don't think we need HADOOP_GLOBAL_OPTS, we can just use HADOOP_OPTS for that, but we could add a HADOOP_NAMENODE_OPTS that, when starting the namenode, is appended to HADOOP_OPTS, etc. In general, we could modify bin/hadoop to add the value of HADOOP_{$COMMAND}_OPTS to HADOOP_OPTS. Would that suffice? > hadoop-env.sh needs finer granularity > - > > Key: HADOOP-2551 > URL: https://issues.apache.org/jira/browse/HADOOP-2551 > Project: Hadoop > Issue Type: Improvement >Reporter: Allen Wittenauer >Priority: Minor > > We often configure our HADOOP_OPTS on the name node to have JMX running so > that we can do JVM monitoring. But doing so means that we need to edit this > file if we want to run other hadoop commands, such as fsck. It would be > useful if hadoop-env.sh was refactored a bit so that there were different > and/or cascading HADOOP_OPTS dependent upon which process/task was being > performed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2531) HDFS FileStatus.getPermission() should return 777 when dfs.permissions=false
[ https://issues.apache.org/jira/browse/HADOOP-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556983#action_12556983 ] Doug Cutting commented on HADOOP-2531: -- Okay, I found it: the default permissions on upgrade are 777, with both user and group set to HadoopAnonymous. So I'm now leaning towards switching to dfs.permissions=true by default. > HDFS FileStatus.getPermission() should return 777 when dfs.permissions=false > > > Key: HADOOP-2531 > URL: https://issues.apache.org/jira/browse/HADOOP-2531 > Project: Hadoop > Issue Type: Bug > Components: dfs >Reporter: Doug Cutting > Fix For: 0.16.0 > > > Generic permission checking code should still work correctly when > dfs.permissions=false. Currently FileStatus#getPermission() returns the > actual permission when dfs.permissions=false on the namenode, which is > incorrect, since all accesses are permitted in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2531) HDFS FileStatus.getPermission() should return 777 when dfs.permissions=false
[ https://issues.apache.org/jira/browse/HADOOP-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556981#action_12556981 ] Doug Cutting commented on HADOOP-2531: -- The use case of dfs.permissions=false was better explained to me yesterday. It is intended to permit admins to set permissions after upgrade while leaving the filesystem available for use. If this use case is really important, then we should mark this "won't fix". Nigel expressed concerns about displaying permissions in "ls" that are not enforced would be confusing to users, that returning 777 would be better for that reason too. But if dfs.permissions is only meant to be used during transition, this may not be a serious issue. I'm beginning to think that dfs.permissions should be 'true' by default, and that the default permission on upgrade should be 777. That is back-compatible. Then, if folks like, they can set more prohibitive permissions and/or disable permission checking. If this is the default behavior then I am okay marking this issue "won't fix". Currently dfs.permissions is 'false' by default, so perhaps that should change. I am not yet certain what the default file permission is after an upgrade... > HDFS FileStatus.getPermission() should return 777 when dfs.permissions=false > > > Key: HADOOP-2531 > URL: https://issues.apache.org/jira/browse/HADOOP-2531 > Project: Hadoop > Issue Type: Bug > Components: dfs >Reporter: Doug Cutting > Fix For: 0.16.0 > > > Generic permission checking code should still work correctly when > dfs.permissions=false. Currently FileStatus#getPermission() returns the > actual permission when dfs.permissions=false on the namenode, which is > incorrect, since all accesses are permitted in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2528) check permissions for job inputs and outputs
[ https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556976#action_12556976 ] Doug Cutting commented on HADOOP-2528: -- Raghu and Hairong yesterday raised a few relevant issues: - the superuser name is not currently known on the client, and until it is, we can get false negatives in permission checks - the goal of dfs.permissions=false is for admins to be able to set, examine and alter permissions before they are enforced, so that a filesystem may be upgraded and returned to service before permissions are completely configured. Returning 777 for all files when dfs.permissions=false would prohibit this use. This patch, as it stands, fights a bit with that use case too. If permission checking is disabled on the namenode, then there's a good chance that permissions are not yet correctly configured there, so checking them clientside may give the wrong results. Thus the goal of permitting folks to run jobs while permissions are being configured may be defeated by this patch. This patch was meant to be provocative: we're providing new APIs, but we have little real code that uses these new APIs. Mapreduce input/output validation seems like an obvious place to add permission checks, and hence an opportunity to check the usability of the APIs. I'm currently on the fence as to whether this patch should be committed in 0.16. Once dfs.permisisons=true, it would be really nice to fail a job quickly if its output directory is not writable, without first running all of the maps. Readability of input is less critical, since that will fail fairly quickly anyway. Perhaps we should add a utility method that checks the writability of a directory by creating and removing an empty file. This would be more reliably correct. I'll create a new patch with this approach. > check permissions for job inputs and outputs > > > Key: HADOOP-2528 > URL: https://issues.apache.org/jira/browse/HADOOP-2528 > Project: Hadoop > Issue Type: Improvement > Components: mapred >Reporter: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2528-0.patch, HADOOP-2528-1.patch > > > On job submission, filesystem permissions should be checked to ensure that > the input directory is readable and that the output directory is writable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-1873) User permissions for Map/Reduce
[ https://issues.apache.org/jira/browse/HADOOP-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556969#action_12556969 ] Doug Cutting commented on HADOOP-1873: -- Another option: In the FileSystem.create and FileSystem.mkdirs static utility methods, we might create the file or directory first, then set the protection. This has the disadvantage of making two RPC calls, but it has the advantage of being thread safe. In the current case (job submission) the performance impact of these extra RPCs would be negligible, no? > User permissions for Map/Reduce > --- > > Key: HADOOP-1873 > URL: https://issues.apache.org/jira/browse/HADOOP-1873 > Project: Hadoop > Issue Type: Improvement >Reporter: Raghu Angadi >Assignee: Hairong Kuang > Attachments: mapred.patch, mapred2.patch, mapred3.patch, > mapred4.patch, mapred5.patch, mapred6.patch > > > HADOOP-1298 and HADOOP-1701 add permissions and pluggable security for DFS > files and DFS accesses. Same users permission should work for Map/Reduce jobs > as well. > User persmission should propegate from client to map/reduce tasks and all the > file operations should be subject to user permissions. This is transparent to > the user (i.e. no changes to user code should be required). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2514) Trash and permissions don't mix
[ https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556947#action_12556947 ] Doug Cutting commented on HADOOP-2514: -- I'm +1 for Sanjay's option 2 for 0.16. Note I don't beleive this issue should be a blocker, since the existing trash code will work with a globally writable /trash. So we need to implement option 2 before the freeze. > Trash and permissions don't mix > --- > > Key: HADOOP-2514 > URL: https://issues.apache.org/jira/browse/HADOOP-2514 > Project: Hadoop > Issue Type: New Feature > Components: dfs >Affects Versions: 0.16.0 >Reporter: Robert Chansler > Fix For: 0.16.0 > > > Shell command "rm" is really "mv" to trash with the expectation that the > server will at some point really delete the contents of trash. With the > advent of permissions, a user can "mv" folders that the user cannot "rm". The > present trash feature as implemented would allow the user to suborn the > server into deleting a folder in violation of the permissions model. > A related issue is that if anybody can mv a folder to the trash anybody else > can mv that same folder from the trash. This may be contrary to the > expectations of the user. > What is a better model for trash? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2528) check permissions for job inputs and outputs
[ https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556146#action_12556146 ] Doug Cutting commented on HADOOP-2528: -- > Should we enable permissions by default in DFS, at least through development > phase [ ...] I think we should certainly encourage developers to do this, but I'm hesitant to change it in subversion. > check permissions for job inputs and outputs > > > Key: HADOOP-2528 > URL: https://issues.apache.org/jira/browse/HADOOP-2528 > Project: Hadoop > Issue Type: Improvement > Components: mapred >Reporter: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2528-0.patch, HADOOP-2528-1.patch > > > On job submission, filesystem permissions should be checked to ensure that > the input directory is readable and that the output directory is writable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2528) check permissions for job inputs and outputs
[ https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated HADOOP-2528: - Attachment: HADOOP-2528-1.patch Here's an updated version of the patch. I previously assumed that one could, e.g., read a file you own with rwx permissions, but in fact you can't. If you're the owner, then only the owner permissions are examined. I've updated the generic checker here to reflect that. I learn something new every day! > check permissions for job inputs and outputs > > > Key: HADOOP-2528 > URL: https://issues.apache.org/jira/browse/HADOOP-2528 > Project: Hadoop > Issue Type: Improvement > Components: mapred >Reporter: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2528-0.patch, HADOOP-2528-1.patch > > > On job submission, filesystem permissions should be checked to ensure that > the input directory is readable and that the output directory is writable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2514) Trash and permissions don't mix
[ https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556138#action_12556138 ] Doug Cutting commented on HADOOP-2514: -- > Trashing will be more efficent I think it is premature to optimize this, especially if that involves complicating the namenode kernel. > We are able to treat delete as delete not rename and therefore perform the > right permission checking. I'm confused by this. Moving something to the trash is not deleting it, it's moving it. Don't we want folks to be able to move things out of the trash again? So the trash needs to be a directory where the user can write things, and that permission must be validated on move-to-trash. We might also check some other things, like whether the user has the right to delete those files, but that's just to keep folks from being surprised later if their trash isn't actually deleted. Someone could still chmod something in the trash and get into the same situation. To truly prevent that we'd need to make the trash into some sort of special purgatory directory with behavior like no other, no? > Trash and permissions don't mix > --- > > Key: HADOOP-2514 > URL: https://issues.apache.org/jira/browse/HADOOP-2514 > Project: Hadoop > Issue Type: New Feature > Components: dfs >Affects Versions: 0.16.0 >Reporter: Robert Chansler > Fix For: 0.16.0 > > > Shell command "rm" is really "mv" to trash with the expectation that the > server will at some point really delete the contents of trash. With the > advent of permissions, a user can "mv" folders that the user cannot "rm". The > present trash feature as implemented would allow the user to suborn the > server into deleting a folder in violation of the permissions model. > A related issue is that if anybody can mv a folder to the trash anybody else > can mv that same folder from the trash. This may be contrary to the > expectations of the user. > What is a better model for trash? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HADOOP-2531) HDFS FileStatus.getPermission() should return 777 when dfs.permissions=false
HDFS FileStatus.getPermission() should return 777 when dfs.permissions=false Key: HADOOP-2531 URL: https://issues.apache.org/jira/browse/HADOOP-2531 Project: Hadoop Issue Type: Bug Components: dfs Reporter: Doug Cutting Fix For: 0.16.0 Generic permission checking code should still work correctly when dfs.permissions=false. Currently FileStatus#getPermission() returns the actual permission when dfs.permissions=false on the namenode, which is incorrect, since all accesses are permitted in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2528) check permissions for job inputs and outputs
[ https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556130#action_12556130 ] Doug Cutting commented on HADOOP-2528: -- > Whether a file is readable/writable also depends on if the user has > searchable permission on all ancestor directories Isn't that already demonstrated by the fact that the file is returned from listStatus()? > check permissions for job inputs and outputs > > > Key: HADOOP-2528 > URL: https://issues.apache.org/jira/browse/HADOOP-2528 > Project: Hadoop > Issue Type: Improvement > Components: mapred >Reporter: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2528-0.patch > > > On job submission, filesystem permissions should be checked to ensure that > the input directory is readable and that the output directory is writable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2514) Trash and permissions don't mix
[ https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556129#action_12556129 ] Doug Cutting commented on HADOOP-2514: -- > if the home directory does not exist, I am proposing that deletes move to a > common trash area. Or the move-to-trash could fail with an exception in this case. > Also note with the trashbin in /user//.trash, instead of > /trash/ the trashbin compacter will have to look in multiple home > dirs instead of merely in /trash. Why is that bad? It'll have to look in the same number of directories in either case, no? > Unfortunately the client side code may find it expensive to do a rpc > per-subtree-entry when deleting a large subtree. It's only an RPC per directory in the tree, not per file. > Are you suggesting a per-user trashbin compacter running as the user? No. But we might have the emptier thread 'su' to each user as it loops through the trash directories so that the checking is implicit and only performed once. I don't like using 'su'-like stuff much though. > Trash and permissions don't mix > --- > > Key: HADOOP-2514 > URL: https://issues.apache.org/jira/browse/HADOOP-2514 > Project: Hadoop > Issue Type: New Feature > Components: dfs >Affects Versions: 0.16.0 >Reporter: Robert Chansler > Fix For: 0.16.0 > > > Shell command "rm" is really "mv" to trash with the expectation that the > server will at some point really delete the contents of trash. With the > advent of permissions, a user can "mv" folders that the user cannot "rm". The > present trash feature as implemented would allow the user to suborn the > server into deleting a folder in violation of the permissions model. > A related issue is that if anybody can mv a folder to the trash anybody else > can mv that same folder from the trash. This may be contrary to the > expectations of the user. > What is a better model for trash? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2528) check permissions for job inputs and outputs
[ https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556123#action_12556123 ] Doug Cutting commented on HADOOP-2528: -- Let me be clear: permission checking of mapred inputs may not work very well yet. But it should work in the 0.16 release. It looks like when dfs.permissions=false that the returned file permissions are not all 777. That's perhaps a bug. Either that, or the FileSystem#checkAccess() utility method added by this patch should somehow check whether permissions are enabled. It is better to begin to address such issues sooner than later in this release cycle. If we advertise that file permissions are implemented in this release, then we ought to attempt to make sure that they're usable, no? Checking permissions while checking existence of inputs seems like a reasonable thing to be able to do, should have no new significant performance impact, and causes us to work some of these things out. > check permissions for job inputs and outputs > > > Key: HADOOP-2528 > URL: https://issues.apache.org/jira/browse/HADOOP-2528 > Project: Hadoop > Issue Type: Improvement > Components: mapred >Reporter: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2528-0.patch > > > On job submission, filesystem permissions should be checked to ensure that > the input directory is readable and that the output directory is writable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-2528) check permissions for job inputs and outputs
[ https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556119#action_12556119 ] Doug Cutting commented on HADOOP-2528: -- > What if permissions checking is disabled (trunk currently allows it. it is in > fact the default)? Yes, it is possible to disable HDFS permission checks. But shouldn't generic permission checking code still work? We don't want every bit of code that uses filesystem permissions to first have to check if permission checking is enabled. Rather, generic permission checking code should be a no-op when permission checking is disabled in a particular filesystem implementation. > This looks similar to how DFS used to invoke 'exits(file)' before opening a > file. Again, this patch causes no new HDFS RPC calls to be made. It just checks the new values now returned. You might argue that we should disable all input and output checks, but that should be done in a separate issue. Input and output checking were added since folks preferred to find out sooner when their jobs were destined to fail. Perhaps with splits generated client-side now input checking is less critical. But checking the output directory is probably still of great value. > I don't think client alone can decide if a particular access is allowed. The value of FileStatus.getPermission() is never null. It should either be "777" or the correct value for filesystems that implement permission checking. > check permissions for job inputs and outputs > > > Key: HADOOP-2528 > URL: https://issues.apache.org/jira/browse/HADOOP-2528 > Project: Hadoop > Issue Type: Improvement > Components: mapred >Reporter: Doug Cutting > Fix For: 0.16.0 > > Attachments: HADOOP-2528-0.patch > > > On job submission, filesystem permissions should be checked to ensure that > the input directory is readable and that the output directory is writable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.