from:"Doug Cutting \(JIRA\)"

[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-17 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2567:
-

Status: Patch Available  (was: Open)

> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, 
> HADOOP-2567-sortvalidate.patch, HADOOP-2567-tests.patch, HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-17 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2567:
-

Status: Open  (was: Patch Available)

HADOOP-2646 has been added to address the SortValidator issue.

> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, 
> HADOOP-2567-sortvalidate.patch, HADOOP-2567-tests.patch, HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-17 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2567:
-

Attachment: HADOOP-2567-tests.patch

Adding tests of new getHomeDirectory() method.

> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, 
> HADOOP-2567-sortvalidate.patch, HADOOP-2567-tests.patch, HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2638) Add close of idle connection to DFSClient and to DataNode DataXceiveServer

2008-01-17 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560120#action_12560120
 ] 

Doug Cutting commented on HADOOP-2638:
--

> Seems clumsier than just closing idle connections [ ... ]

It also gives you explicit control.  If you do need to iterate over a range of 
keys, then you can wait to release the connection until you've completed the 
iteration, while a pread-based approach would have to open a new connection per 
buffer refill or somesuch.

As for simplicity, background threads that time stuff out are hairy and easy to 
get subtly wrong.  Folks also don't generally like more background threads 
running in the client's JVM, since clients should be lean-and-mean.

> Add close of idle connection to DFSClient and to DataNode DataXceiveServer
> --
>
> Key: HADOOP-2638
> URL: https://issues.apache.org/jira/browse/HADOOP-2638
> Project: Hadoop
>  Issue Type: Improvement
>  Components: dfs
>Reporter: stack
>
> This issue is for adding timeout and shutdown of idle DFSClient <-> DataNode 
> connections.
> Applications can have DFS usage patterns than deviate from that of MR 'norm' 
> where files are generally opened, sucked down as fast as is possible, and 
> then closed.  For example, at the other extreme, hbase wants to support fast 
> random reading of key values over a sometimes relatively large set of 
> MapFiles or MapFile equivalents.  To avoid paying startup costs on every 
> random read -- opening the file and reading in the index each time -- hbase 
> just keeps all of its MapFiles open all the time.
> In an hbase cluster of any significant size, this can add up to lots of file 
> handles per process: See HADOOP-2577, " [hbase] Scaling: Too many open file 
> handles to datanodes" for an accounting.
> Given how DFSClient and DataXceiveServer interact when random reading, and 
> given past observations that have the client-side file handles mostly stuck 
> in CLOSE_WAIT (See HADOOP-2341, 'Datanode active connections never returns to 
> 0'), a suggestion made up on the list today, that idle connections should be 
> timedout and closed, would help applications that have hbase-like access 
> patterns conserve file handles and allow them scale.
> Below is context that comes of the mailing list under the subject: 'Re: 
> Multiplexing sockets in DFSClient/datanodes?'
> {code}
> stack wrote:
> > Doug Cutting wrote:
> >> RPC also tears down idle connections, which HDFS does not.  I wonder how 
> >> much doing that alone might help your case?  That would probably be much 
> >> simpler to implement.  Both client and server must already handle 
> >> connection failures, so it shouldn't be too great of a change to have one 
> >> or both sides actively close things down if they're idle for more than a 
> >> few seconds.
> >
> > If we added tear down of idle sockets, that'd work for us and, as you 
> > suggest, should be easier to do than rewriting the client to use async i/o. 
> >   Currently, random reading, its probably rare that the currently opened 
> > HDFS block has the wanted offset and so a tear down of the current socket 
> > and an open of a new one is being done anyways.
> HADOOP-2346 helps with the Datanode side of the problem. We still need 
> DFSClient to clean up idle connections (otherwise these sockets will stay in 
> CLOSE_WAIT state on the client). This would require an extra thread on client 
> to clean up these connections. You could file a jira for it.
> Raghu. 
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2646) SortValidator broken with fully-qualified working directories

2008-01-17 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2646:
-

Attachment: HADOOP-2646.patch

This patch is known to fix SortValidator on single-node clusters, but may have 
not work on multi-node clusters.  See HADOOP-2567 for details.

> SortValidator broken with fully-qualified working directories
> -
>
> Key: HADOOP-2646
> URL: https://issues.apache.org/jira/browse/HADOOP-2646
> Project: Hadoop
>  Issue Type: Bug
>  Components: test
>Reporter: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2646.patch
>
>
> The sort validator is broken by HADOOP-2567.  In particular, it no longer 
> works when DistributedFileSystem#getWorkingDirectory() returns a 
> fully-qualified path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-2646) SortValidator broken with fully-qualified working directories

2008-01-17 Thread Doug Cutting (JIRA)

SortValidator broken with fully-qualified working directories
-

 Key: HADOOP-2646
 URL: https://issues.apache.org/jira/browse/HADOOP-2646
 Project: Hadoop
  Issue Type: Bug
  Components: test
Reporter: Doug Cutting
 Fix For: 0.16.0


The sort validator is broken by HADOOP-2567.  In particular, it no longer works 
when DistributedFileSystem#getWorkingDirectory() returns a fully-qualified path.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-17 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560106#action_12560106
 ] 

Doug Cutting commented on HADOOP-2567:
--

I am unable to reproduce this failure.  The single-machine instructions you 
gave above generates four input files and one output file.  I modified the sort 
command line so that four output files are used, since the code in question 
involves determining whether a given input to the validator is a sort input or 
output, but that still validated correctly.

Perhaps Arun, who originally wrote the validator, could have a look at this?


> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, 
> HADOOP-2567-sortvalidate.patch, HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2634) Deprecate exists() and isDir() to simplify ClientProtocol.

2008-01-17 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560101#action_12560101
 ] 

Doug Cutting commented on HADOOP-2634:
--

Hairong has addressed these inconsistencies in HADOOP-2566.

Yes, the implementation of exists() in terms of getFileStatus() would be 
simple.  However it is considered bad style to use exceptions for normal 
control flow, and exists() returning false is a normal condition.  We might 
just have to live with that...


> Deprecate exists() and isDir() to simplify ClientProtocol.
> --
>
> Key: HADOOP-2634
> URL: https://issues.apache.org/jira/browse/HADOOP-2634
> Project: Hadoop
>  Issue Type: Improvement
>  Components: dfs
>Affects Versions: 0.15.0
>Reporter: Konstantin Shvachko
>
> ClientProtocol can be simplified by removing two methods
> {code}
> public boolean exists(String src) throws IOException;
> public boolean isDir(String src) throws IOException;
> {code}
> This is a redundant api, which can be implemented in DFSClient as convenience 
> methods using
> {code}
> public DFSFileInfo getFileInfo(String src) throws IOException;
> {code}
> Note that we already deprecated several Filesystem method and advised to use 
> getFileStatus() instead.
> Should we deprecate them in 0.16?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2638) Add close of idle connection to DFSClient and to DataNode DataXceiveServer

2008-01-17 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560090#action_12560090
 ] 

Doug Cutting commented on HADOOP-2638:
--

> I'd be interested to hear how the aforementioned 'pread' would be better than 
> whats going on underneath the MapFile.get.

In HDFS, each call to pread opens a new connection to a datanode, reads the 
requested data, then closes the connection.  If the requested data spans 
multiple blocks it will open connections for each block as required, it will 
re-try on network errors, etc.  But, bottom-line, no connection is left open.

When you initially open an HDFS file it does not open a connection to any 
datanodes: only the namenode is consulted on open.  Once you call read(byte[]), 
a datanode connection is generally held open.  But, if one only ever uses 
pread, then no connection is held open.

Another approach to fixing this would be to add an FSInputStream method to 
close the connection to the datanode, perhaps called release().  The stream 
would still be open and at the same position, but some attached resources may 
be released.  The default implementation would do nothing, but for HDFS it 
would close any open datanode connection.  Then we could add a 
SequenceFile#release(), and similarly for MapFile.  Then, after a call to 
MapFile#get() you could explicitly release the underlying connection.  That 
might be the simplest fix to implement.

> Add close of idle connection to DFSClient and to DataNode DataXceiveServer
> --
>
> Key: HADOOP-2638
> URL: https://issues.apache.org/jira/browse/HADOOP-2638
> Project: Hadoop
>  Issue Type: Improvement
>  Components: dfs
>Reporter: stack
>
> This issue is for adding timeout and shutdown of idle DFSClient <-> DataNode 
> connections.
> Applications can have DFS usage patterns than deviate from that of MR 'norm' 
> where files are generally opened, sucked down as fast as is possible, and 
> then closed.  For example, at the other extreme, hbase wants to support fast 
> random reading of key values over a sometimes relatively large set of 
> MapFiles or MapFile equivalents.  To avoid paying startup costs on every 
> random read -- opening the file and reading in the index each time -- hbase 
> just keeps all of its MapFiles open all the time.
> In an hbase cluster of any significant size, this can add up to lots of file 
> handles per process: See HADOOP-2577, " [hbase] Scaling: Too many open file 
> handles to datanodes" for an accounting.
> Given how DFSClient and DataXceiveServer interact when random reading, and 
> given past observations that have the client-side file handles mostly stuck 
> in CLOSE_WAIT (See HADOOP-2341, 'Datanode active connections never returns to 
> 0'), a suggestion made up on the list today, that idle connections should be 
> timedout and closed, would help applications that have hbase-like access 
> patterns conserve file handles and allow them scale.
> Below is context that comes of the mailing list under the subject: 'Re: 
> Multiplexing sockets in DFSClient/datanodes?'
> {code}
> stack wrote:
> > Doug Cutting wrote:
> >> RPC also tears down idle connections, which HDFS does not.  I wonder how 
> >> much doing that alone might help your case?  That would probably be much 
> >> simpler to implement.  Both client and server must already handle 
> >> connection failures, so it shouldn't be too great of a change to have one 
> >> or both sides actively close things down if they're idle for more than a 
> >> few seconds.
> >
> > If we added tear down of idle sockets, that'd work for us and, as you 
> > suggest, should be easier to do than rewriting the client to use async i/o. 
> >   Currently, random reading, its probably rare that the currently opened 
> > HDFS block has the wanted offset and so a tear down of the current socket 
> > and an open of a new one is being done anyways.
> HADOOP-2346 helps with the Datanode side of the problem. We still need 
> DFSClient to clean up idle connections (otherwise these sockets will stay in 
> CLOSE_WAIT state on the client). This would require an extra thread on client 
> to clean up these connections. You could file a jira for it.
> Raghu. 
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

2008-01-17 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560086#action_12560086
 ] 

Doug Cutting commented on HADOOP-2566:
--

Do listStatus(Path[]) and globStatus(Path[]) need to be public?  Does anyone 
use these but the globbing code?  I generally prefer not to make something 
public without a strong need.  Other than that, this looks good to me.


> need FileSystem#globStatus method
> -
>
> Key: HADOOP-2566
> URL: https://issues.apache.org/jira/browse/HADOOP-2566
> Project: Hadoop
>  Issue Type: Improvement
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
> Attachments: globStatus.patch, globStatus1.patch, globStatus2.patch, 
> globStatus3.patch, globStatus4.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting 
> performance, we must use file enumeration APIs that return FileStatus[] 
> rather than Path[].  Currently we have FileSystem#globPaths(), but that 
> method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the 
> cache in 0.17.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2638) Add close of idle connection to DFSClient and to DataNode DataXceiveServer

2008-01-17 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560061#action_12560061
 ] 

Doug Cutting commented on HADOOP-2638:
--

Are you suggesting that MapFile#Reader change to use read(pos, buf, off, len), 
aka pread, exclusively?  That's an interesting idea.  We could implement this 
by adding an option to SequenceFile#Reader to always use pread.   MapFile would 
not use this option for its index file, which is always read in its entirety, 
but only for its data file.  It would mean that, should one seek to a key and 
then do sequential access, that each buffer refill would require a new 
connection, which would not be optimal.  But that could be optimized: a buffer 
refill triggered by next() could switch the underlying data file to non-pread 
mode, while the next seek() might convert it back to pread mode.

> Add close of idle connection to DFSClient and to DataNode DataXceiveServer
> --
>
> Key: HADOOP-2638
> URL: https://issues.apache.org/jira/browse/HADOOP-2638
> Project: Hadoop
>  Issue Type: Improvement
>  Components: dfs
>Reporter: stack
>
> This issue is for adding timeout and shutdown of idle DFSClient <-> DataNode 
> connections.
> Applications can have DFS usage patterns than deviate from that of MR 'norm' 
> where files are generally opened, sucked down as fast as is possible, and 
> then closed.  For example, at the other extreme, hbase wants to support fast 
> random reading of key values over a sometimes relatively large set of 
> MapFiles or MapFile equivalents.  To avoid paying startup costs on every 
> random read -- opening the file and reading in the index each time -- hbase 
> just keeps all of its MapFiles open all the time.
> In an hbase cluster of any significant size, this can add up to lots of file 
> handles per process: See HADOOP-2577, " [hbase] Scaling: Too many open file 
> handles to datanodes" for an accounting.
> Given how DFSClient and DataXceiveServer interact when random reading, and 
> given past observations that have the client-side file handles mostly stuck 
> in CLOSE_WAIT (See HADOOP-2341, 'Datanode active connections never returns to 
> 0'), a suggestion made up on the list today, that idle connections should be 
> timedout and closed, would help applications that have hbase-like access 
> patterns conserve file handles and allow them scale.
> Below is context that comes of the mailing list under the subject: 'Re: 
> Multiplexing sockets in DFSClient/datanodes?'
> {code}
> stack wrote:
> > Doug Cutting wrote:
> >> RPC also tears down idle connections, which HDFS does not.  I wonder how 
> >> much doing that alone might help your case?  That would probably be much 
> >> simpler to implement.  Both client and server must already handle 
> >> connection failures, so it shouldn't be too great of a change to have one 
> >> or both sides actively close things down if they're idle for more than a 
> >> few seconds.
> >
> > If we added tear down of idle sockets, that'd work for us and, as you 
> > suggest, should be easier to do than rewriting the client to use async i/o. 
> >   Currently, random reading, its probably rare that the currently opened 
> > HDFS block has the wanted offset and so a tear down of the current socket 
> > and an open of a new one is being done anyways.
> HADOOP-2346 helps with the Datanode side of the problem. We still need 
> DFSClient to clean up idle connections (otherwise these sockets will stay in 
> CLOSE_WAIT state on the client). This would require an extra thread on client 
> to clean up these connections. You could file a jira for it.
> Raghu. 
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2634) Deprecate exists() and isDir() to simplify ClientProtocol.

2008-01-17 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560052#action_12560052
 ] 

Doug Cutting commented on HADOOP-2634:
--

+1 for removing those protocol methods.

FileSystem#exists() should probably be made a concrete method in 
FileSystem.java, defined in terms of getFileStatus(), most existing 
implementations can probably be removed, and it could probably be deprecated.

BTW, what is getFileStatus() supposed to do when a file does not exist?  Throw 
an IOException or return null?  The former is generally preferable, but the 
latter makes implementing exists() easier, since we should not use exception 
handling for normal program flow.

I don't see a need to do this the day before 0.16 feature freeze, and it could 
be destabilizing.


> Deprecate exists() and isDir() to simplify ClientProtocol.
> --
>
> Key: HADOOP-2634
> URL: https://issues.apache.org/jira/browse/HADOOP-2634
> Project: Hadoop
>  Issue Type: Improvement
>  Components: dfs
>Affects Versions: 0.15.0
>Reporter: Konstantin Shvachko
>
> ClientProtocol can be simplified by removing two methods
> {code}
> public boolean exists(String src) throws IOException;
> public boolean isDir(String src) throws IOException;
> {code}
> This is a redundant api, which can be implemented in DFSClient as convenience 
> methods using
> {code}
> public DFSFileInfo getFileInfo(String src) throws IOException;
> {code}
> Note that we already deprecated several Filesystem method and advised to use 
> getFileStatus() instead.
> Should we deprecate them in 0.16?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2626) RawLocalFileStatus is badly handling URIs

2008-01-17 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560043#action_12560043
 ] 

Doug Cutting commented on HADOOP-2626:
--

> What about this patch then ?

That looks better to me, in that the returned Path is now fully qualified.  
Does it handle escapes any better than before?  If not, 'new 
Path(file.toUri().getPath()).makeQualified(fs)' may do better.

As Nigel indicates, some test cases would be very useful.



> RawLocalFileStatus is badly handling URIs
> -
>
> Key: HADOOP-2626
> URL: https://issues.apache.org/jira/browse/HADOOP-2626
> Project: Hadoop
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 0.15.2
>Reporter: Frédéric Bertin
> Attachments: HADOOP-2626.patch
>
>
> as a result, files with special characters (that get encoded when translated 
> to URIs) are badly handled using a local filesystem.
> {{new Path(f.toURI().toString()))}} should be replaced by {{new 
> Path(f.toURI().getPath()))}}
> IMHO, each call to {{toURI().toString()}} should be considered suspicious. 
> There's another one in the class CopyFiles at line 641.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2421) Release JDiff report of changes between different versions of Hadoop

2008-01-17 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560038#action_12560038
 ] 

Doug Cutting commented on HADOOP-2421:
--

> where do i get the "OLD" one ?

The approach I suggested in the Lucene issue was to have some ant properties 
that determine the subversion tag url of the prior version.  In trunk this 
would point to the prior release.  We'd update it in trunk after each release 
is made.  Then the ant build script would check this out in build/ if it didn't 
already exist there.  We could (and should) permit this to be optimized, 
perhaps by permitting folks to override a property so that the prior version is 
stored somewhere more permanent than build/, and perhaps use 'svn switch; svn 
update' to make sure that the cached prior version contains what we expect.

> Does the user need to download and then pass the path to ant with -D option ?

I'd imagined that specifying -Djdiff.prior.dir would be optional, but would 
help performance a lot, but we could make it mandatory, and emit an error if 
it's not specified.  That might reduce the load on subversion somewhat.

> Release JDiff report of changes between different versions of Hadoop
> 
>
> Key: HADOOP-2421
> URL: https://issues.apache.org/jira/browse/HADOOP-2421
> Project: Hadoop
>  Issue Type: Improvement
>  Components: build
>Reporter: Nigel Daley
>Priority: Minor
>
> Similar to LUCENE-1083, it would be useful to report javadoc differences (ala 
> [JDiff|http://www.jdiff.org/]) between Hadoop releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2608) Reading sequence file consumes 100% cpu with maximum throughput being about 5MB/sec per process

2008-01-17 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560033#action_12560033
 ] 

Doug Cutting commented on HADOOP-2608:
--

We might also look to see whether 
org.apache.hadoop.record.Utils.fromBinaryString could be made any faster.  What 
happens if this just does 'new String(bytes, "UTF-8")'?  Is the problem our 
homegrown UTF-8 decoder, or UTF-8 decoding in general?  It'd be nice to return 
org.apache.io.Text instead, since that permits many string operations w/o 
decoding UTF-8, but that'd be a bigger change.


> Reading sequence file consumes 100% cpu with maximum throughput being about 
> 5MB/sec per process
> ---
>
> Key: HADOOP-2608
> URL: https://issues.apache.org/jira/browse/HADOOP-2608
> Project: Hadoop
>  Issue Type: Improvement
>  Components: io
>Reporter: Runping Qi
>
> I did some tests on the throughput of scanning block-compressed sequence 
> files.
> The sustained throughput was bounded at 5MB/sec per process, with the cpu of 
> each process maxed at 100%.
> It seems to me that the cpu consumption is too high and the throughput is too 
> low for just scanning files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2626) RawLocalFileStatus is badly handling URIs

2008-01-16 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2626:
-

Status: Open  (was: Patch Available)

This patch puts an unqualified path in the returned FileStatus.  That's not 
strictly a bug, but we've found that it's always safest to return 
fully-qualified paths whenever we can.

To convert the java.io.File to a Path, we might use new 
Path(file.getPath()).makeQualified(fs).  Perhaps this should be added as a 
fileToPath method, since there's already a path2File() method.

> RawLocalFileStatus is badly handling URIs
> -
>
> Key: HADOOP-2626
> URL: https://issues.apache.org/jira/browse/HADOOP-2626
> Project: Hadoop
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 0.15.2
>Reporter: Frédéric Bertin
> Attachments: patch-Hadoop-2626.diff
>
>
> as a result, files with special characters (that get encoded when translated 
> to URIs) are badly handled using a local filesystem.
> {{new Path(f.toURI().toString()))}} should be replaced by {{new 
> Path(f.toURI().getPath()))}}
> IMHO, each call to {{toURI().toString()}} should be considered suspicious. 
> There's another one in the class CopyFiles at line 641.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

2008-01-16 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559723#action_12559723
 ] 

Doug Cutting commented on HADOOP-2566:
--

> Is this what we wanted? I thought we wanted other way around.

I don't think it does that in all cases, but it does still appear to call 
getStatus() in places.  I've not yet examined the logic to see if that's easily 
avoidable or not.  But it's not a fatal problem at this point.  For this 
release the important thing is to have globStatus() as the preferred, 
non-deprecated method.  Once we remove the status cache, during 0.17 
development, we'll soon find out whether the globStatus() implementation needs 
more work to perform well without a cache, and fix that before 0.17 is 
released.  But that aspect shouldn't block this for 0.16, since we still have 
the cache in 0.16.


> need FileSystem#globStatus method
> -
>
> Key: HADOOP-2566
> URL: https://issues.apache.org/jira/browse/HADOOP-2566
> Project: Hadoop
>  Issue Type: Improvement
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
> Attachments: globStatus.patch, globStatus1.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting 
> performance, we must use file enumeration APIs that return FileStatus[] 
> rather than Path[].  Currently we have FileSystem#globPaths(), but that 
> method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the 
> cache in 0.17.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2604) [hbase] Create an HBase-specific MapFile implementation

2008-01-16 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559676#action_12559676
 ] 

Doug Cutting commented on HADOOP-2604:
--

> it'd be nice to iterate on the keys of a MapFile without actually reading the 
> data

SequenceFile supports that, so it shouldn't be too hard to add a 
next(WritableComparable) method to the MapFile API, right?

> [hbase] Create an HBase-specific MapFile implementation
> ---
>
> Key: HADOOP-2604
> URL: https://issues.apache.org/jira/browse/HADOOP-2604
> Project: Hadoop
>  Issue Type: Improvement
>  Components: contrib/hbase
>Reporter: Bryan Duxbury
>Priority: Minor
>
> Today, HBase uses the Hadoop MapFile class to store data persistently to 
> disk. This is convenient, as it's already done (and maintained by other 
> people :). However, it's beginning to look like there might be possible 
> performance benefits to be had from doing an HBase-specific implementation of 
> MapFile that incorporated some precise features.
> This issue should serve as a place to track discussion about what features 
> might be included in such an implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2604) [hbase] Create an HBase-specific MapFile implementation

2008-01-16 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559655#action_12559655
 ] 

Doug Cutting commented on HADOOP-2604:
--

> Exclude column family name from the file [ ... ]

The column family name could be stored in the SequenceFile's metadata, no?  
MapFile's constructors don't currently permit one to specify metadata, but 
that'd be easy to add.

> There is some indication that the existing MapFile implementation is 
> optimized for streaming access [ ... ]

It shouldn't be.  The problem is that mapreduce, what's primarily used to 
benchmark and debug Hadoop, doesn't do any random access.  So it's easy for 
random-access-related performance problems to sneak into MapFile and HDFS.  
Both Nutch and HBase depend on efficient random access from Hadoop, primarily 
through MapFile.  We need a good random-access benchmark that someone regularly 
executes.  Perhaps one could be added to the sort benchmark suite, since that 
is regularly run by Yahoo!?  Or someone else could start running regular HBase 
benchmarks on a grid somewhere?

> [hbase] Create an HBase-specific MapFile implementation
> ---
>
> Key: HADOOP-2604
> URL: https://issues.apache.org/jira/browse/HADOOP-2604
> Project: Hadoop
>  Issue Type: Improvement
>  Components: contrib/hbase
>Reporter: Bryan Duxbury
>Priority: Minor
>
> Today, HBase uses the Hadoop MapFile class to store data persistently to 
> disk. This is convenient, as it's already done (and maintained by other 
> people :). However, it's beginning to look like there might be possible 
> performance benefits to be had from doing an HBase-specific implementation of 
> MapFile that incorporated some precise features.
> This issue should serve as a place to track discussion about what features 
> might be included in such an implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2421) Release JDiff report of changes between different versions of Hadoop

2008-01-16 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559629#action_12559629
 ] 

Doug Cutting commented on HADOOP-2421:
--

> Is there a URI where the old documentation is available ?

There are public URIs where released javadocs may be obtained, but I don't 
think JDiff uses the normal javadoc, but rather special javadoc output that it 
generates.

Please see LUCENE-1083 which addresses this further.


> Release JDiff report of changes between different versions of Hadoop
> 
>
> Key: HADOOP-2421
> URL: https://issues.apache.org/jira/browse/HADOOP-2421
> Project: Hadoop
>  Issue Type: Improvement
>  Components: build
>Reporter: Nigel Daley
>Priority: Minor
>
> Similar to LUCENE-1083, it would be useful to report javadoc differences (ala 
> [JDiff|http://www.jdiff.org/]) between Hadoop releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-16 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559628#action_12559628
 ] 

Doug Cutting commented on HADOOP-2567:
--

> Seems that the tests are still failing. Earlier I tried on a single machine 
> and it worked.

Did you restart the cluster running the patched code?  That may be required.

Did it fail with the same error?


> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, 
> HADOOP-2567-sortvalidate.patch, HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

2008-01-15 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559299#action_12559299
 ] 

Doug Cutting commented on HADOOP-2566:
--

A few comments:
- should stat2paths be a public method on FileSystem?  I'd prefer it were 
either private or perhaps on FileUtil.
- globPaths() isn't deprecated.  Do we think we'll keep this, or should it be 
deprecated?  It is handy in some cases, but, on the other hand, we'd like to 
force folks to examine their uses of it, since in most cases performance will 
become abysmal once the FileStatus cache is removed, and we don't want to 
surprise folks with that.  Thoughts?


> need FileSystem#globStatus method
> -
>
> Key: HADOOP-2566
> URL: https://issues.apache.org/jira/browse/HADOOP-2566
> Project: Hadoop
>  Issue Type: Improvement
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
> Attachments: globStatus.patch
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting 
> performance, we must use file enumeration APIs that return FileStatus[] 
> rather than Path[].  Currently we have FileSystem#globPaths(), but that 
> method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the 
> cache in 0.17.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-15 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2567:
-

Status: Patch Available  (was: Reopened)

> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, 
> HADOOP-2567-sortvalidate.patch, HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-15 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2567:
-

Attachment: HADOOP-2567-sortvalidate.patch

The attached patch fixes the sort validator.

Amar, can you please confirm that this fixes things for you?  Thanks!

> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, 
> HADOOP-2567-sortvalidate.patch, HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

2008-01-15 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559284#action_12559284
 ] 

Doug Cutting commented on HADOOP-2566:
--

> Should a user of globStatus() be able to distinguish between a non-existent 
> path and a glob that does not match any files?

I'm not sure I completely understand the distinction.  In one case are you 
passing a path without any meta characters but that does not exist, and in the 
other one with metacharacters but that matches no files?

In any case it should probably handle this the same way globPaths() does.  If 
the distinction is important then perhaps the non-existing file case should 
return null, while the non-matching expression case should return an empty 
array.


> need FileSystem#globStatus method
> -
>
> Key: HADOOP-2566
> URL: https://issues.apache.org/jira/browse/HADOOP-2566
> Project: Hadoop
>  Issue Type: Improvement
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting 
> performance, we must use file enumeration APIs that return FileStatus[] 
> rather than Path[].  Currently we have FileSystem#globPaths(), but that 
> method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the 
> cache in 0.17.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2514) Trash and permissions don't mix

2008-01-15 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559216#action_12559216
 ] 

Doug Cutting commented on HADOOP-2514:
--

> I just committed this.

Oops.  I should have had someone review this first.

Could someone please review this now?

Should I revert it until someone does?


> Trash and permissions don't mix
> ---
>
> Key: HADOOP-2514
> URL: https://issues.apache.org/jira/browse/HADOOP-2514
> Project: Hadoop
>  Issue Type: New Feature
>  Components: dfs
>Affects Versions: 0.16.0
>Reporter: Robert Chansler
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2514.patch
>
>
> Shell command "rm" is really "mv" to trash with the expectation that the 
> server will at some point really delete the contents of trash. With the 
> advent of permissions, a user can "mv" folders that the user cannot "rm". The 
> present trash feature as implemented would allow the user to suborn the 
> server into deleting a folder in violation of the permissions model.
> A related issue is that if anybody can mv a folder to the trash anybody else 
> can mv that same folder from the trash. This may be contrary to the 
> expectations of the user.
> What is a better model for trash?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2514) Trash and permissions don't mix

2008-01-15 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2514:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this.

> Trash and permissions don't mix
> ---
>
> Key: HADOOP-2514
> URL: https://issues.apache.org/jira/browse/HADOOP-2514
> Project: Hadoop
>  Issue Type: New Feature
>  Components: dfs
>Affects Versions: 0.16.0
>Reporter: Robert Chansler
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2514.patch
>
>
> Shell command "rm" is really "mv" to trash with the expectation that the 
> server will at some point really delete the contents of trash. With the 
> advent of permissions, a user can "mv" folders that the user cannot "rm". The 
> present trash feature as implemented would allow the user to suborn the 
> server into deleting a folder in violation of the permissions model.
> A related issue is that if anybody can mv a folder to the trash anybody else 
> can mv that same folder from the trash. This may be contrary to the 
> expectations of the user.
> What is a better model for trash?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2543) No-permission-checking mode for smooth transition to 0.16's permissions features.

2008-01-15 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559165#action_12559165
 ] 

Doug Cutting commented on HADOOP-2543:
--

> setting x77 means that there is a potential window where missed files can be 
> co-opted by someone who shouldn't have them.

Like all files are today?  I don't follow.  We currently have zero security.  
The security we're adding in this release is easy to subvert and mostly to keep 
folks from shooting themselves in the foot.  Keeping the "window" that's wide 
open today open a bit longer doesn't significantly compromise anything.

> all those requiring backwards compatibility should just keep perms turned off.

We want folks to be able to upgrade, then use new features, without jumping 
through hoops.  Hoops should be optional. If you wish to be able to configure a 
non-777 permission for after upgrade, that would be a reasonable feature, but 
777 should be the default.

So perhaps we need a dfs.initial.permission parameter, used by the upgrade, 
whose default value is 777, but that you can override along with setting 
dfs.permissions=false, to support the upgrade procedure you desire.  But I 
don't think we should force all installations through that procedure in order 
to get a usable system.  We know from experience that most folks just install 
the new version and expect things to work out of the box.  When they don't they 
file bugs.

> No-permission-checking mode for smooth transition to 0.16's permissions 
> features. 
> --
>
> Key: HADOOP-2543
> URL: https://issues.apache.org/jira/browse/HADOOP-2543
> Project: Hadoop
>  Issue Type: New Feature
>  Components: dfs
>Affects Versions: 0.15.1
>Reporter: Sanjay Radia
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
>
> In moving to 0.16,  which will support permissions, a mode of no-permission 
> checking has been proposed to allow smooth transition to using the new 
> permissions feature.
> The idea is that at first 0.16 will be used for a period of time with 
> permission checking off. 
> Later after the admin has changed ownership and permissions of various files, 
> the permission checking can be turned off.
> This Jira defines what the semantics are of the no-permission-checking mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2431) Test HDFS File Permissions

2008-01-15 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559146#action_12559146
 ] 

Doug Cutting commented on HADOOP-2431:
--

Perhaps if the exception named in the RemoteException is a class that's loaded 
on the client and is permitted by the method signature, then RPC should 
automatically try to construct an instance and throw it.  But that's not what 
RPC does today.  If you feel it should do this, please file a separate issue.

The FileSystem API promises that applications which attempt to violate 
permissions will be thrown an AccessControlException.  Today, until RPC is 
changed, we must intercept RemoteException and explicitly throw an 
AccessControlException.  The fact that a particular FileSystem is implemented 
using RPC should be invisible to clients.

> Test HDFS File Permissions
> --
>
> Key: HADOOP-2431
> URL: https://issues.apache.org/jira/browse/HADOOP-2431
> Project: Hadoop
>  Issue Type: Test
>  Components: test
>Affects Versions: 0.15.1
>Reporter: Hairong Kuang
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
> Attachments: HDFSPermissionSpecification6.pdf, 
> PermissionsTestPlan1.pdf, testDFSPermission.patch, testDFSPermission1.patch
>
>
> This jira is intended to provide junit tests to HADOOP-1298.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2385) Validate configuration parameters

2008-01-15 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559140#action_12559140
 ] 

Doug Cutting commented on HADOOP-2385:
--

> I would prefer creating new classes solely dedicated to configuration logic [ 
> ... ]

I think this varies, case-by-case.  For a complex subsystem like HDFS, it may 
make sense to have dedicated configuration classes.  For a standalone classes, 
like an InputFormat or compression codec, it probably makes sense to put 
configuration accessors directly on the class in question.


> Validate configuration parameters
> -
>
> Key: HADOOP-2385
> URL: https://issues.apache.org/jira/browse/HADOOP-2385
> Project: Hadoop
>  Issue Type: Improvement
>  Components: dfs
>Affects Versions: 0.16.0
>Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data 
> nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled 
> parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2543) No-permission-checking mode for smooth transition to 0.16's permissions features.

2008-01-14 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558879#action_12558879
 ] 

Doug Cutting commented on HADOOP-2543:
--

> Explicitly tightening them is more backwards compatible, but from the 
> security point of view, explicitly loosening them is safer.

Yes, and for this upgrade, back-compatibility is more important than 
immediately increasing security.  We don't decrease security any, and folks can 
easily increase security after the upgrade by tightening permissions.  But we 
don't want things to be broken as soon as they upgrade by automatically 
tightening permissions.

What I'm proposing is essentially the use-case you describe above for using 
dfs.permission=false, but without setting that: after the upgrade everything is 
permitted, and folks can start restricting access, but without having to 
restart the cluster.  I think for most sites this is simpler, less surprising 
and sufficient.

> No-permission-checking mode for smooth transition to 0.16's permissions 
> features. 
> --
>
> Key: HADOOP-2543
> URL: https://issues.apache.org/jira/browse/HADOOP-2543
> Project: Hadoop
>  Issue Type: New Feature
>  Components: dfs
>Affects Versions: 0.15.1
>Reporter: Sanjay Radia
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
>
> In moving to 0.16,  which will support permissions, a mode of no-permission 
> checking has been proposed to allow smooth transition to using the new 
> permissions feature.
> The idea is that at first 0.16 will be used for a period of time with 
> permission checking off. 
> Later after the admin has changed ownership and permissions of various files, 
> the permission checking can be turned off.
> This Jira defines what the semantics are of the no-permission-checking mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2431) Test HDFS File Permissions

2008-01-14 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558874#action_12558874
 ] 

Doug Cutting commented on HADOOP-2431:
--

This tests that permission check failures throw a RemoteException wrapping an 
AccessControlException.  Shouldn't permission check failures throw an 
AccessControlException directly?

DistributedFileSystem or DFSClient should catch the RemoteException and, when 
it wraps an AccessControlException, throw one of those so that client code sees 
that, no?  Should I file a separate issue for this?

> Test HDFS File Permissions
> --
>
> Key: HADOOP-2431
> URL: https://issues.apache.org/jira/browse/HADOOP-2431
> Project: Hadoop
>  Issue Type: Test
>  Components: test
>Affects Versions: 0.15.1
>Reporter: Hairong Kuang
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
> Attachments: HDFSPermissionSpecification6.pdf, 
> PermissionsTestPlan1.pdf, testDFSPermission.patch, testDFSPermission1.patch
>
>
> This jira is intended to provide junit tests to HADOOP-1298.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2514) Trash and permissions don't mix

2008-01-14 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2514:
-

Status: Patch Available  (was: Open)

> Trash and permissions don't mix
> ---
>
> Key: HADOOP-2514
> URL: https://issues.apache.org/jira/browse/HADOOP-2514
> Project: Hadoop
>  Issue Type: New Feature
>  Components: dfs
>Affects Versions: 0.16.0
>Reporter: Robert Chansler
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2514.patch
>
>
> Shell command "rm" is really "mv" to trash with the expectation that the 
> server will at some point really delete the contents of trash. With the 
> advent of permissions, a user can "mv" folders that the user cannot "rm". The 
> present trash feature as implemented would allow the user to suborn the 
> server into deleting a folder in violation of the permissions model.
> A related issue is that if anybody can mv a folder to the trash anybody else 
> can mv that same folder from the trash. This may be contrary to the 
> expectations of the user.
> What is a better model for trash?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2514) Trash and permissions don't mix

2008-01-14 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2514:
-

Attachment: HADOOP-2514.patch

Here's a patch that implements Sanjay's option 2.

> Trash and permissions don't mix
> ---
>
> Key: HADOOP-2514
> URL: https://issues.apache.org/jira/browse/HADOOP-2514
> Project: Hadoop
>  Issue Type: New Feature
>  Components: dfs
>Affects Versions: 0.16.0
>Reporter: Robert Chansler
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2514.patch
>
>
> Shell command "rm" is really "mv" to trash with the expectation that the 
> server will at some point really delete the contents of trash. With the 
> advent of permissions, a user can "mv" folders that the user cannot "rm". The 
> present trash feature as implemented would allow the user to suborn the 
> server into deleting a folder in violation of the permissions model.
> A related issue is that if anybody can mv a folder to the trash anybody else 
> can mv that same folder from the trash. This may be contrary to the 
> expectations of the user.
> What is a better model for trash?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-14 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558782#action_12558782
 ] 

Doug Cutting commented on HADOOP-2567:
--

> Currently the trunk does not pass the sort validation tests.

Can you please attach details, like a log or stack trace?  Or at least 
instructions on how to reproduce this.  Thanks!

Also, it might be good to run a scaled-down version of the sort benchmark & 
validation during unit testing, so that we exercise those codepaths and find 
things like this sooner.

> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, 
> HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2543) No-permission-checking mode for smooth transition to 0.16's permissions features.

2008-01-14 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558770#action_12558770
 ] 

Doug Cutting commented on HADOOP-2543:
--

> 1) all files and directories will be owned by the super user and super group;

That seems fine.

> 2) the permission of the files is set to be 0600 and the permission of the 
> directories is set to be 0700.

The use of dfs.permissions=false should be optional, no?  Folks should be able 
to upgrade and use the filesystem as before, but this would break that.  The 
default protection after upgrade should continue to be 777 and that folks 
should need to explicitly tighten permissions rather than explicitly loosen 
them.


> No-permission-checking mode for smooth transition to 0.16's permissions 
> features. 
> --
>
> Key: HADOOP-2543
> URL: https://issues.apache.org/jira/browse/HADOOP-2543
> Project: Hadoop
>  Issue Type: New Feature
>  Components: dfs
>Affects Versions: 0.15.1
>Reporter: Sanjay Radia
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
>
> In moving to 0.16,  which will support permissions, a mode of no-permission 
> checking has been proposed to allow smooth transition to using the new 
> permissions feature.
> The idea is that at first 0.16 will be used for a period of time with 
> permission checking off. 
> Later after the admin has changed ownership and permissions of various files, 
> the permission checking can be turned off.
> This Jira defines what the semantics are of the no-permission-checking mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2543) No-permission-checking mode for smooth transition to 0.16's permissions features.

2008-01-14 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558745#action_12558745
 ] 

Doug Cutting commented on HADOOP-2543:
--

How does this differ from the way that dfs.permissions=false works already?  Is 
this a documentation issue, or are there functional changes required?

> No-permission-checking mode for smooth transition to 0.16's permissions 
> features. 
> --
>
> Key: HADOOP-2543
> URL: https://issues.apache.org/jira/browse/HADOOP-2543
> Project: Hadoop
>  Issue Type: New Feature
>  Components: dfs
>Affects Versions: 0.15.1
>Reporter: Sanjay Radia
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
>
> In moving to 0.16,  which will support permissions, a mode of no-permission 
> checking has been proposed to allow smooth transition to using the new 
> permissions feature.
> The idea is that at first 0.16 will be used for a period of time with 
> permission checking off. 
> Later after the admin has changed ownership and permissions of various files, 
> the permission checking can be turned off.
> This Jira defines what the semantics are of the no-permission-checking mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

2008-01-14 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558742#action_12558742
 ] 

Doug Cutting commented on HADOOP-2566:
--



> For example, globPath("/user/*/data") needs only to listPath("/user").

But listPaths() is not a primitive, it is a utility method defined in terms of 
listStatus().  So this example is calling listStatus("/user") and then 
stripping the list of FileStatus objects down to a list of Path objects.  We 
should remove that stripping, or at least make it optional.  To make it 
optional, the primitive glob operation should be globStatus, and globPaths() 
should become a utility method defined in terms of globStatus().

> Some of shell commands like delete, copy, and rename use globPath but don't 
> need FileStatus.

These actually all do need the FileStatus.  They need to find out whether each 
file is a directory or not, to find out when to recurse.  Copy also needs other 
attributes so that they can be set on the copy too.  So we'll end up needing to 
rework these.

We will not remove globPaths() in this release, so these commands do not need 
to change right now.  But before we can remove the cache we need to examine 
every place that calls globPaths to check whether these must be converted to 
use globStatus.  That's why we're deprecating globPaths(), to force folks to do 
this.  Then, in 0.17, we can remove the cache from trunk, and start identifying 
all the problems.  But we want users who upgrade to 0.17 to be forwarned, and 
to have an API that supports cache-free use before we remove the cache, so that 
they can upgrade to 0.17 more smoothly.


> need FileSystem#globStatus method
> -
>
> Key: HADOOP-2566
> URL: https://issues.apache.org/jira/browse/HADOOP-2566
> Project: Hadoop
>  Issue Type: Improvement
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting 
> performance, we must use file enumeration APIs that return FileStatus[] 
> rather than Path[].  Currently we have FileSystem#globPaths(), but that 
> method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the 
> cache in 0.17.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2346) DataNode should have timeout on socket writes.

2008-01-14 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558709#action_12558709
 ] 

Doug Cutting commented on HADOOP-2346:
--

This looks nice!  How well does it work?

SocketInputStream and SocketOutputStream seem like fine names, but should they 
be nested classes in IOUtils, or perhaps independent classes in the 'net' 
package?

Also, we might make the error messages in the exceptions a bit more 
informative, e.g., including the address the socket is connected to, the 
timeout, etc.

> DataNode should have timeout on socket writes.
> --
>
> Key: HADOOP-2346
> URL: https://issues.apache.org/jira/browse/HADOOP-2346
> Project: Hadoop
>  Issue Type: Bug
>  Components: dfs
>Affects Versions: 0.15.1
>Reporter: Raghu Angadi
>Assignee: Raghu Angadi
> Attachments: HADOOP-2346.patch
>
>
> If a client opens a file and stops reading in the middle, DataNode thread 
> writing the data could be stuck forever. For DataNode sockets we set read 
> timeout but not write timeout. I think we should add a write(data, timeout) 
> method in IOUtils that assumes it the underlying FileChannel is non-blocking.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2066) filenames with ':' colon throws java.lang.IllegalArgumentException

2008-01-14 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558701#action_12558701
 ] 

Doug Cutting commented on HADOOP-2066:
--

> I'm not at all sure it makes sense to define what is a valid filename based 
> on a URI library.

The URI standard provides a good interchange syntax for file names.  But we 
shouldn't let it limit what names are possible in various filesystems as we do 
today: we should support their full range, using escapes where necessary.  
Unfortunately, with the current API, we can't tell when a character needs to be 
escaped or when it is intended as a URI meta-character.

The problem is that we construct paths in FileSystem independent code, so we 
don't know how to escape things.  Perhaps the solution is to remove the public 
Path constructor and force all Paths to be created by a FileSystem#createPath 
method, so that they can be escaped appropriately.

Thus, when running on Windows, if one passes a string with unescaped 
backslashes to LocalFileSystem#createPath(), the backslashes would be 
interpreted as directory separators, while on Linux or HDFS they'd be treated 
as literals.  Unescaped slashes in a Path URI will always be directory 
separators, since that's the URI standard we're using for interchange.


> filenames with ':' colon throws java.lang.IllegalArgumentException
> --
>
> Key: HADOOP-2066
> URL: https://issues.apache.org/jira/browse/HADOOP-2066
> Project: Hadoop
>  Issue Type: Bug
>  Components: dfs
>Reporter: lohit vijayarenu
> Attachments: 2066_20071022.patch, HADOOP-2066.patch
>
>
> File names containing colon ":" throws  java.lang.IllegalArgumentException 
> while LINUX file system supports it.
> $ hadoop dfs -put ./testfile-2007-09-24-03:00:00.gz filenametest
> Exception in thread "main" java.lang.IllegalArgumentException: 
> java.net.URISyntaxException: Relative path in absolute
> URI: testfile-2007-09-24-03:00:00.gz
>   at org.apache.hadoop.fs.Path.initialize(Path.java:140)
>   at org.apache.hadoop.fs.Path.(Path.java:126)
>   at org.apache.hadoop.fs.Path.(Path.java:50)
>   at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:273)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:117)
>   at 
> org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:776)
>   at 
> org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:757)
>   at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:116)
>   at org.apache.hadoop.fs.FsShell.run(FsShell.java:1229)
>   at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:187)
>   at org.apache.hadoop.fs.FsShell.main(FsShell.java:1342)
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
> testfile-2007-09-24-03:00:00.gz
>   at java.net.URI.checkPath(URI.java:1787)
>   at java.net.URI.(URI.java:735)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:137)
>   ... 10 more
> Path(String pathString) when given a filename which contains ':' treats it as 
> URI and selects anything before ':' as
> scheme, which in this case is clearly not a valid scheme.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-11 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2567:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this.

> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, 
> HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

2008-01-11 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558129#action_12558129
 ] 

Doug Cutting commented on HADOOP-2566:
--

Globbing is implemented on top of listPaths() which is implemented on top of 
listStatus().  The primitive globbing API should not throw away that status 
information.  It should keep it so that glob clients which need it do not have 
to call getStatus() for each file that matches.  Currently the cache of 
FileStatus hides the cost of these getStatus() calls, but that cache will break 
things once files and their status can change.  So we need globStatus() before 
we can remove the cache.

FileInputFormat, for example, uses globPaths() to list files matching the input 
specification, then it uses getStatus() on each matching path when building 
splits.  This must change to call globStatus() before the cache is removed.

Long-term, globPaths() and listPaths() may perhaps still be useful as a utility 
methods implemented in terms of of globStatus() and listStatus(), but since 
most current users of these will be broken performancewise once the cache is 
removed, we should deprecate them now to strongly encourage folks to stop using 
them before that cache is removed, to give fair warning.


> need FileSystem#globStatus method
> -
>
> Key: HADOOP-2566
> URL: https://issues.apache.org/jira/browse/HADOOP-2566
> Project: Hadoop
>  Issue Type: Improvement
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting 
> performance, we must use file enumeration APIs that return FileStatus[] 
> rather than Path[].  Currently we have FileSystem#globPaths(), but that 
> method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the 
> cache in 0.17.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2566) need FileSystem#globStatus method

2008-01-11 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558071#action_12558071
 ] 

Doug Cutting commented on HADOOP-2566:
--

No, we need 'FileStatus[] globStatus(Path pattern)' instead of 'Path[] 
globPaths(Path pattern)'.

> need FileSystem#globStatus method
> -
>
> Key: HADOOP-2566
> URL: https://issues.apache.org/jira/browse/HADOOP-2566
> Project: Hadoop
>  Issue Type: Improvement
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting 
> performance, we must use file enumeration APIs that return FileStatus[] 
> rather than Path[].  Currently we have FileSystem#globPaths(), but that 
> method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the 
> cache in 0.17.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2560) Combining multiple input blocks into one mapper

2008-01-10 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557821#action_12557821
 ] 

Doug Cutting commented on HADOOP-2560:
--

> Much simpler to make the late binding decision to bundle them.

The algorithm I outlined above could be done incrementally, rather than all 
up-front:
- N is the desired splits/task
- build  map for the job inputs
- when a node asks for a task, pop up to N splits off its list to form a task
- if a node has no more splits, pop splits from other nodes
- as each split is popped, remove it from other map entries

This is essentially the existing algorithm, except that we allocate more than 
one split per task.  In fact, the existing algorithm handles lots of other 
subtle cases like speculative execution, task failure, etc.  So the best way to 
implement this is probably to use the existing algorithm multiple times per 
task, etc.

Earlier I'd spoke of implementing this up front, when constructing splits.  But 
if it's done this way, then we needn't actually change public APIs or 
InputFormats.  Tasks could simply internally be changed to execute a list of 
splits rather than a single split.


> Combining multiple input blocks into one mapper
> ---
>
> Key: HADOOP-2560
> URL: https://issues.apache.org/jira/browse/HADOOP-2560
> Project: Hadoop
>  Issue Type: Bug
>Reporter: Runping Qi
>
> Currently, an input split contains a consecutive chunk of input file, which 
> by default, corresponding to a DFS block.
> This may lead to a large number of mapper tasks if the input data is large. 
> This leads to the following problems:
> 1. Shuffling cost: since the framework has to move M * R map output segments 
> to the nodes running reducers, 
> larger M means larger shuffling cost.
> 2. High JVM initialization overhead
> 3. Disk fragmentation: larger number of map output files means lower read 
> throughput for accessing them.
> Ideally, you want to keep the number of mappers to no more than 16 times the 
> number of  nodes in the cluster.
> To achive that, we can increase the input split size. However, if a split 
> span over more than one dfs block,
> you lose the data locality scheduling benefits.
> One way to address this problem is to combine multiple input blocks with the 
> same rack into one split.
> If in average we combine B blocks into one split, then we will reduce the 
> number of mappers by a factor of B.
> Since all the blocks for one mapper share a rack, thus we can benefit from 
> rack-aware scheduling.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2510) Map-Reduce 2.0

2008-01-10 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557782#action_12557782
 ] 

Doug Cutting commented on HADOOP-2510:
--

> The JobScheduler should not be part of the MapReduce sub-project.

If we can build MapReduce on top of some shared infrastructure, e.g., a 
JobScheduler, that is independently maintained and used by a larger community 
than just the mapreduce community, then that might be a good thing.  So I'd 
love to see a proposal that defines a generally useful primitive layer, with 
examples of multiple, useful systems that can be layered on top of it, 
including mapreduce.  Also, when this is implemented, I would argue that at 
least one of these other higher-level systems should be implemented too, in 
addition to mapreduce, to prove the generality of the lower-level system.  
Things intended to be reusable that are not in fact reused tend not to actually 
be reusable.

Whether this more primitive layer should be a library that we use to build 
mapreduce daemons, or a service is an interesting question.  The latter would 
better permit a cluster to be shared by mapreduce and non-mapreduce tasks.


> Map-Reduce 2.0
> --
>
> Key: HADOOP-2510
> URL: https://issues.apache.org/jira/browse/HADOOP-2510
> Project: Hadoop
>  Issue Type: Improvement
>  Components: mapred
>Reporter: Arun C Murthy
>
> We, at Yahoo!, have been using Hadoop-On-Demand as the resource 
> provisioning/scheduling mechanism. 
> With HoD the user uses a self-service system to ask-for a set of nodes. HoD 
> allocates these from a global pool and also provisions a private Map-Reduce 
> cluster for the user. She then runs her jobs and shuts the cluster down via 
> HoD when done. All user-private clusters use the same humongous, static HDFS 
> (e.g. 2k node HDFS). 
> More details about HoD are available here: HADOOP-1301.
> 
> h3. Motivation
> The current deployment (Hadoop + HoD) has a couple of implications:
>  * _Non-optimal Cluster Utilization_
>1. Job-private Map-Reduce clusters imply that the user-cluster potentially 
> could be *idle* for atleast a while before being detected and shut-down.
>2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with 
> much-smaller no. of reduces; with maps being light and quick and reduces 
> being i/o heavy and longer-running. Users typically allocate clusters 
> depending on the no. of maps (i.e. input size) which leads to the scenario 
> where all the maps are done (idle nodes in the cluster) and the few reduces 
> are chugging along. Right now, we do not have the ability to shrink the 
> HoD'ed Map-Reduce clusters which would alleviate this issue. 
>  * _Impact on data-locality_
> With the current setup of a static, large HDFS and much smaller (5/10/20/50 
> node) clusters there is a good chance of losing one of Map-Reduce's primary 
> features: ability to execute tasks on the datanodes where the input splits 
> are located. In fact, we have seen the data-local tasks go down to 20-25 
> percent in the GridMix benchmarks, from the 95-98 percent we see on the 
> randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a 
> synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware 
> Map-Reduce) helps significantly here.
> 
> Primarily, the notion of *job-level scheduling* leading to private clusers, 
> as opposed to *task-level scheduling*, is a good peg to hang-on the majority 
> of the blame.
> Keeping the above factors in mind, here are some thoughts on how to 
> re-structure Hadoop Map-Reduce to solve some of these issues.
> 
> h3. State of the Art
> As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD 
> for a bit) does provide task-level scheduling; however as it exists today, 
> it's scalability to tens-of-thousands of user-jobs, per-week, is in question.
> Lets review it's current architecture and main components:
>  * JobTracker: It does both *task-scheduling* and *task-monitoring* 
> (tasktrackers send task-statuses via periodic heartbeats), which implies it 
> is fairly loaded. It is also a _single-point of failure_ in the Map-Reduce 
> framework i.e. its failure implies that all the jobs in the system fail. This 
> means a static, large Map-Reduce cluster is fairly susceptible and a definite 
> suspect. Clearly HoD solves this by having per-job clusters, albeit with the 
> above drawbacks.
>  * TaskTracker: The slave in the system which executes one task at-a-time 
> under directions from the JobTracker.
>  * JobClient: The per-job client which just submits the job and polls the 
> JobTracker for status. 
> 
> h3. Proposal - Map-Reduce 2.0 
> The primary idea is to move to task-level scheduling and static Map-Reduce 
> clusters (so as to maintain the same storage cluster and compute clust

[jira] Commented: (HADOOP-2510) Map-Reduce 2.0

2008-01-10 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557778#action_12557778
 ] 

Doug Cutting commented on HADOOP-2510:
--

> the logic required to run multiple MapReduce jobs is different enough from 
> running a single
> MapReduce job that separate daemons would provide a much cleaner 
> implementation.

If it would improve the implementation, then we should better layer the logic.  
I have no problem with that.  But layering the logic within a single address 
space will yield a more reliable system than distributing it across multiple 
hosts.  It may be less scalable to keep all the logic in a single service, but 
I have yet to be convinced that the jobtracker is a scalability bottleneck.  
So, sure, let's clean up the jobtracker with modular decomposition, but I have 
yet to see how running different modules of the jobtracker on different hosts 
will improve things.

> Map-Reduce 2.0
> --
>
> Key: HADOOP-2510
> URL: https://issues.apache.org/jira/browse/HADOOP-2510
> Project: Hadoop
>  Issue Type: Improvement
>  Components: mapred
>Reporter: Arun C Murthy
>
> We, at Yahoo!, have been using Hadoop-On-Demand as the resource 
> provisioning/scheduling mechanism. 
> With HoD the user uses a self-service system to ask-for a set of nodes. HoD 
> allocates these from a global pool and also provisions a private Map-Reduce 
> cluster for the user. She then runs her jobs and shuts the cluster down via 
> HoD when done. All user-private clusters use the same humongous, static HDFS 
> (e.g. 2k node HDFS). 
> More details about HoD are available here: HADOOP-1301.
> 
> h3. Motivation
> The current deployment (Hadoop + HoD) has a couple of implications:
>  * _Non-optimal Cluster Utilization_
>1. Job-private Map-Reduce clusters imply that the user-cluster potentially 
> could be *idle* for atleast a while before being detected and shut-down.
>2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with 
> much-smaller no. of reduces; with maps being light and quick and reduces 
> being i/o heavy and longer-running. Users typically allocate clusters 
> depending on the no. of maps (i.e. input size) which leads to the scenario 
> where all the maps are done (idle nodes in the cluster) and the few reduces 
> are chugging along. Right now, we do not have the ability to shrink the 
> HoD'ed Map-Reduce clusters which would alleviate this issue. 
>  * _Impact on data-locality_
> With the current setup of a static, large HDFS and much smaller (5/10/20/50 
> node) clusters there is a good chance of losing one of Map-Reduce's primary 
> features: ability to execute tasks on the datanodes where the input splits 
> are located. In fact, we have seen the data-local tasks go down to 20-25 
> percent in the GridMix benchmarks, from the 95-98 percent we see on the 
> randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a 
> synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware 
> Map-Reduce) helps significantly here.
> 
> Primarily, the notion of *job-level scheduling* leading to private clusers, 
> as opposed to *task-level scheduling*, is a good peg to hang-on the majority 
> of the blame.
> Keeping the above factors in mind, here are some thoughts on how to 
> re-structure Hadoop Map-Reduce to solve some of these issues.
> 
> h3. State of the Art
> As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD 
> for a bit) does provide task-level scheduling; however as it exists today, 
> it's scalability to tens-of-thousands of user-jobs, per-week, is in question.
> Lets review it's current architecture and main components:
>  * JobTracker: It does both *task-scheduling* and *task-monitoring* 
> (tasktrackers send task-statuses via periodic heartbeats), which implies it 
> is fairly loaded. It is also a _single-point of failure_ in the Map-Reduce 
> framework i.e. its failure implies that all the jobs in the system fail. This 
> means a static, large Map-Reduce cluster is fairly susceptible and a definite 
> suspect. Clearly HoD solves this by having per-job clusters, albeit with the 
> above drawbacks.
>  * TaskTracker: The slave in the system which executes one task at-a-time 
> under directions from the JobTracker.
>  * JobClient: The per-job client which just submits the job and polls the 
> JobTracker for status. 
> 
> h3. Proposal - Map-Reduce 2.0 
> The primary idea is to move to task-level scheduling and static Map-Reduce 
> clusters (so as to maintain the same storage cluster and compute cluster 
> paradigm) as a way to directly tackle the two main issues illustrated above. 
> Clearly, we will have to get around the existing problems, especially w.r.t. 
> scalability and reliability.
> The proposal is to re-work Hadoop Map-Reduce to

[jira] Commented: (HADOOP-2573) limit running tasks per job

2008-01-10 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557775#action_12557775
 ] 

Doug Cutting commented on HADOOP-2573:
--

> The limit could be max(static_limit, number of cores in cluster / # active 
> jobs)

Jinx!

> limit running tasks per job
> ---
>
> Key: HADOOP-2573
> URL: https://issues.apache.org/jira/browse/HADOOP-2573
> Project: Hadoop
>  Issue Type: New Feature
>  Components: mapred
>Reporter: Doug Cutting
> Fix For: 0.17.0
>
>
> It should be possible to specify a limit to the number of tasks per job 
> permitted to run simultaneously.  If, for example, you have a cluster of 50 
> nodes, with 100 map task slots and 100 reduce task slots, and the configured 
> limit is 25 simultaneous tasks/job, then four or more jobs will be able to 
> run at a time.  This will permit short jobs to pass longer-running jobs.  
> This also avoids some problems we've seen with HOD, where nodes are 
> underutilized in their tail, and it should permit improved input locality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2573) limit running tasks per job

2008-01-10 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557767#action_12557767
 ] 

Doug Cutting commented on HADOOP-2573:
--

I think a static limit for all jobs would be useful and best to implement 
first.  After some experience with this, we would be better able to address its 
shortcomings.  Possible future extensions might be:
- dynamically altering the limit, e.g., limit=max(min.tasks.per.job, 
numSlots/numJobsOutstanding)
 -- ramping up the limit slowly, so that a users's sequential jobs don't have 
all their slots immediately taken when one job completes
 -- ramping down the limit slowly, so that tasks are given an opportunity to 
finish normally before they are killed.
- incorporating job priority into the limit


> limit running tasks per job
> ---
>
> Key: HADOOP-2573
> URL: https://issues.apache.org/jira/browse/HADOOP-2573
> Project: Hadoop
>  Issue Type: New Feature
>  Components: mapred
>Reporter: Doug Cutting
> Fix For: 0.17.0
>
>
> It should be possible to specify a limit to the number of tasks per job 
> permitted to run simultaneously.  If, for example, you have a cluster of 50 
> nodes, with 100 map task slots and 100 reduce task slots, and the configured 
> limit is 25 simultaneous tasks/job, then four or more jobs will be able to 
> run at a time.  This will permit short jobs to pass longer-running jobs.  
> This also avoids some problems we've seen with HOD, where nodes are 
> underutilized in their tail, and it should permit improved input locality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2573) limit running tasks per job

2008-01-10 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557758#action_12557758
 ] 

Doug Cutting commented on HADOOP-2573:
--

Some discussion of this issue may be found at:

http://www.nabble.com/question-about-file-glob-in-hadoop-0.15-tt14702242.html#a14741794


> limit running tasks per job
> ---
>
> Key: HADOOP-2573
> URL: https://issues.apache.org/jira/browse/HADOOP-2573
> Project: Hadoop
>  Issue Type: New Feature
>  Components: mapred
>Reporter: Doug Cutting
> Fix For: 0.17.0
>
>
> It should be possible to specify a limit to the number of tasks per job 
> permitted to run simultaneously.  If, for example, you have a cluster of 50 
> nodes, with 100 map task slots and 100 reduce task slots, and the configured 
> limit is 25 simultaneous tasks/job, then four or more jobs will be able to 
> run at a time.  This will permit short jobs to pass longer-running jobs.  
> This also avoids some problems we've seen with HOD, where nodes are 
> underutilized in their tail, and it should permit improved input locality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-2574) bugs in mapred tutorial

2008-01-10 Thread Doug Cutting (JIRA)

bugs in mapred tutorial
---

 Key: HADOOP-2574
 URL: https://issues.apache.org/jira/browse/HADOOP-2574
 Project: Hadoop
  Issue Type: Bug
  Components: documentation
Reporter: Doug Cutting
 Fix For: 0.15.3, 0.16.0


Sam Pullara sends me:
{noformat}
Phu was going through the WordCount example... lines 52 and 53 should have 
args[0] and args[1]:

http://lucene.apache.org/hadoop/docs/current/mapred_tutorial.html

The javac and jar command are also wrong, they don't include the directories 
for the packages, should be:

$ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d classes 
WordCount.java 
$ jar -cvf /usr/joe/wordcount.jar WordCount.class -C classes .

{noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2573) limit running tasks per job

2008-01-10 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557724#action_12557724
 ] 

Doug Cutting commented on HADOOP-2573:
--

This addresses issues raised in HADOOP-2510.

> limit running tasks per job
> ---
>
> Key: HADOOP-2573
> URL: https://issues.apache.org/jira/browse/HADOOP-2573
> Project: Hadoop
>  Issue Type: New Feature
>  Components: mapred
>Reporter: Doug Cutting
> Fix For: 0.17.0
>
>
> It should be possible to specify a limit to the number of tasks per job 
> permitted to run simultaneously.  If, for example, you have a cluster of 50 
> nodes, with 100 map task slots and 100 reduce task slots, and the configured 
> limit is 25 simultaneous tasks/job, then four or more jobs will be able to 
> run at a time.  This will permit short jobs to pass longer-running jobs.  
> This also avoids some problems we've seen with HOD, where nodes are 
> underutilized in their tail, and it should permit improved input locality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2510) Map-Reduce 2.0

2008-01-10 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557722#action_12557722
 ] 

Doug Cutting commented on HADOOP-2510:
--

I added HADOOP-2573 for the approach I propose above.

> Map-Reduce 2.0
> --
>
> Key: HADOOP-2510
> URL: https://issues.apache.org/jira/browse/HADOOP-2510
> Project: Hadoop
>  Issue Type: Improvement
>  Components: mapred
>Reporter: Arun C Murthy
>
> We, at Yahoo!, have been using Hadoop-On-Demand as the resource 
> provisioning/scheduling mechanism. 
> With HoD the user uses a self-service system to ask-for a set of nodes. HoD 
> allocates these from a global pool and also provisions a private Map-Reduce 
> cluster for the user. She then runs her jobs and shuts the cluster down via 
> HoD when done. All user-private clusters use the same humongous, static HDFS 
> (e.g. 2k node HDFS). 
> More details about HoD are available here: HADOOP-1301.
> 
> h3. Motivation
> The current deployment (Hadoop + HoD) has a couple of implications:
>  * _Non-optimal Cluster Utilization_
>1. Job-private Map-Reduce clusters imply that the user-cluster potentially 
> could be *idle* for atleast a while before being detected and shut-down.
>2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with 
> much-smaller no. of reduces; with maps being light and quick and reduces 
> being i/o heavy and longer-running. Users typically allocate clusters 
> depending on the no. of maps (i.e. input size) which leads to the scenario 
> where all the maps are done (idle nodes in the cluster) and the few reduces 
> are chugging along. Right now, we do not have the ability to shrink the 
> HoD'ed Map-Reduce clusters which would alleviate this issue. 
>  * _Impact on data-locality_
> With the current setup of a static, large HDFS and much smaller (5/10/20/50 
> node) clusters there is a good chance of losing one of Map-Reduce's primary 
> features: ability to execute tasks on the datanodes where the input splits 
> are located. In fact, we have seen the data-local tasks go down to 20-25 
> percent in the GridMix benchmarks, from the 95-98 percent we see on the 
> randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a 
> synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware 
> Map-Reduce) helps significantly here.
> 
> Primarily, the notion of *job-level scheduling* leading to private clusers, 
> as opposed to *task-level scheduling*, is a good peg to hang-on the majority 
> of the blame.
> Keeping the above factors in mind, here are some thoughts on how to 
> re-structure Hadoop Map-Reduce to solve some of these issues.
> 
> h3. State of the Art
> As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD 
> for a bit) does provide task-level scheduling; however as it exists today, 
> it's scalability to tens-of-thousands of user-jobs, per-week, is in question.
> Lets review it's current architecture and main components:
>  * JobTracker: It does both *task-scheduling* and *task-monitoring* 
> (tasktrackers send task-statuses via periodic heartbeats), which implies it 
> is fairly loaded. It is also a _single-point of failure_ in the Map-Reduce 
> framework i.e. its failure implies that all the jobs in the system fail. This 
> means a static, large Map-Reduce cluster is fairly susceptible and a definite 
> suspect. Clearly HoD solves this by having per-job clusters, albeit with the 
> above drawbacks.
>  * TaskTracker: The slave in the system which executes one task at-a-time 
> under directions from the JobTracker.
>  * JobClient: The per-job client which just submits the job and polls the 
> JobTracker for status. 
> 
> h3. Proposal - Map-Reduce 2.0 
> The primary idea is to move to task-level scheduling and static Map-Reduce 
> clusters (so as to maintain the same storage cluster and compute cluster 
> paradigm) as a way to directly tackle the two main issues illustrated above. 
> Clearly, we will have to get around the existing problems, especially w.r.t. 
> scalability and reliability.
> The proposal is to re-work Hadoop Map-Reduce to make it suitable for a large, 
> static cluster. 
> Here is an overview of how its main components would look like:
>  * JobTracker: Turn the JobTracker into a pure task-scheduler, a global one. 
> Lets call this the *JobScheduler* henceforth. Clearly (data-locality aware) 
> Maui/Moab are  candidates for being the scheduler, in which case, the 
> JobScheduler is just a thin wrapper around them. 
>  * TaskTracker: These stay as before, without some minor changes as 
> illustrated later in the piece.
>  * JobClient: Fatten up the JobClient my putting a lot more intelligence into 
> it. Enhance it to talk to the JobTracker to ask for available TaskTrackers 
> and then contact them to schedule and m

[jira] Created: (HADOOP-2573) limit running tasks per job

2008-01-10 Thread Doug Cutting (JIRA)

limit running tasks per job
---

 Key: HADOOP-2573
 URL: https://issues.apache.org/jira/browse/HADOOP-2573
 Project: Hadoop
  Issue Type: New Feature
  Components: mapred
Reporter: Doug Cutting
 Fix For: 0.17.0


It should be possible to specify a limit to the number of tasks per job 
permitted to run simultaneously.  If, for example, you have a cluster of 50 
nodes, with 100 map task slots and 100 reduce task slots, and the configured 
limit is 25 simultaneous tasks/job, then four or more jobs will be able to run 
at a time.  This will permit short jobs to pass longer-running jobs.  This also 
avoids some problems we've seen with HOD, where nodes are underutilized in 
their tail, and it should permit improved input locality.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-10 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557714#action_12557714
 ] 

Doug Cutting commented on HADOOP-2567:
--

> Would it be make sense to use UserGroupInformation to determine the home dir?

Yes, someday.  Long-term, username's should be filesystem-specific.  But we 
don't yet have an API to get the username for a particular filesystem.  Once 
that's added, it should be returned as a UserGroupInformation and used to 
determine the home directory, but until then, I think this is not worth adding.

Note that this patch does not change how the home directory in HDFS is 
computed, it only adds a method to expose the home directory already implicit 
in HDFS.  Changing how we compute it should perhaps be the subject of another 
issue.


> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: 2567-3.patch, HADOOP-2567-1.patch, HADOOP-2567-2.patch, 
> HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2298) ant target without source and docs

2008-01-10 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557709#action_12557709
 ] 

Doug Cutting commented on HADOOP-2298:
--

> No one has mentioned any specific name for the target and "minimal" tarfile. 

I think such things are typically called "binary" or "bin" distributions, no?

> ant target without source and docs 
> ---
>
> Key: HADOOP-2298
> URL: https://issues.apache.org/jira/browse/HADOOP-2298
> Project: Hadoop
>  Issue Type: Improvement
>  Components: build
>Reporter: Gautam Kowshik
> Attachments: 2298.patch.1
>
>
> Can we have an ant target or a -D option to build the hadoop tar without the 
> source and documentation? This brings down the tar size from 11.5 MB to 5.6 
> MB. This would speed up distribution. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2560) Combining multiple input blocks into one mapper

2008-01-09 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557481#action_12557481
 ] 

Doug Cutting commented on HADOOP-2560:
--

> The current system takes full advantage of mapping jobs to nodes dynamically.

Currently we compute and cache the mapping once per job, and then base all 
subsequent decisions on that cache.  We get ~99% job locality with that 
'static' information.  Things should be about about the same if we group 
things, unless I'm missing something.

> One could perhaps do something like what you suggest dynamically in the JT 
> when a TT requests a new job.

That's a possible enhancement.  I'm not sure it's required for good 
localization, and it would add significant load to the namenode.

> Combining multiple input blocks into one mapper
> ---
>
> Key: HADOOP-2560
> URL: https://issues.apache.org/jira/browse/HADOOP-2560
> Project: Hadoop
>  Issue Type: Bug
>Reporter: Runping Qi
>
> Currently, an input split contains a consecutive chunk of input file, which 
> by default, corresponding to a DFS block.
> This may lead to a large number of mapper tasks if the input data is large. 
> This leads to the following problems:
> 1. Shuffling cost: since the framework has to move M * R map output segments 
> to the nodes running reducers, 
> larger M means larger shuffling cost.
> 2. High JVM initialization overhead
> 3. Disk fragmentation: larger number of map output files means lower read 
> throughput for accessing them.
> Ideally, you want to keep the number of mappers to no more than 16 times the 
> number of  nodes in the cluster.
> To achive that, we can increase the input split size. However, if a split 
> span over more than one dfs block,
> you lose the data locality scheduling benefits.
> One way to address this problem is to combine multiple input blocks with the 
> same rack into one split.
> If in average we combine B blocks into one split, then we will reduce the 
> number of mappers by a factor of B.
> Since all the blocks for one mapper share a rack, thus we can benefit from 
> rack-aware scheduling.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2385) Validate configuration parameters

2008-01-09 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557478#action_12557478
 ] 

Doug Cutting commented on HADOOP-2385:
--

> This means that the configuration classes should be public then, right?

Yes, if the parameters they access should be publicly accessible.  One might 
argue that certain parameters are only consumed internally and don't need 
public accessors, but, more typically, parameter accessors are on public 
classes.

> And it doesn't matter where the get/setters are.
> Particularly we can combine all of them in one class
> or even place them in the Configuration class. Is it what you want?

They shouldn't be all in one place or all in Configuration for the same reason 
that we don't put everything in a single file: we should attempt to keep 
related things together, to localize changes.  So an HDFS-specific parameter 
accessor should be on an HDFS-specific class.  How fine-grained we localize 
isn't clear.  Generally, finer is better: find the most-specific public class 
that encompasses the use and add the accessor there.  So if something's only 
used in the Datanode, but used in a few different classes there, then it might 
best be on Datanode.

> What I meant is that we keep placing logically independent
> code inside e.g. FSNamesystem, which makes it bigger, while it could easily 
> be made a separate class.
> And configuration is just an example of such logically independent part.

If configuration stuff is not specific to FSNamesystem (i.e., logically 
independent) then it shouldn't go there.  If it is specific to FSNamesystem 
then it could go there, or perhaps on a new class that's used only by 
FSNamesystem, e.g., FSNamesystemParams.  If it's used equally by FSNamesystem 
and other classes then it could either go on an existing shared class (e.g., 
Namenode) or a new shared class (NamenodeParams).


> Validate configuration parameters
> -
>
> Key: HADOOP-2385
> URL: https://issues.apache.org/jira/browse/HADOOP-2385
> Project: Hadoop
>  Issue Type: Improvement
>  Components: dfs
>Affects Versions: 0.16.0
>Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data 
> nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled 
> parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2552) enable hdfs permission checking by default

2008-01-09 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2552:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

I just committed this.

> enable hdfs permission checking by default
> --
>
> Key: HADOOP-2552
> URL: https://issues.apache.org/jira/browse/HADOOP-2552
> Project: Hadoop
>  Issue Type: Improvement
>  Components: dfs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2552.patch
>
>
> We should enable permission checking in dfs by default.  Currently, on 
> upgrade, all file permissions are 777, so this is a back-compatible change.  
> After an upgrade folks can change owners and groups and limit permissions, 
> and things will work as expected.
> The current default, dfs.permissions=false, gives inconsistent behaviour: 
> permissions are displayed in 'ls' and returned by the FileSystem APIs, but 
> they're not enforced.  In future releases we will certainly want 
> dfs.permissions=true to be the default, and making it so now will thus also 
> avoid an incompatible change.
> dfs.permissions=false should be an optional, non-default configuration that 
> some sites may decide to use.  It is further defined in HADOOP-2543.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-09 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2567:
-

Attachment: HADOOP-2567-2.patch

Fix another place that assumed working directory wasn't fully qualified.

> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2567-1.patch, HADOOP-2567-2.patch, 
> HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2404) HADOOP-2185 breaks compatibility with hadoop-0.15.0

2008-01-09 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557441#action_12557441
 ] 

Doug Cutting commented on HADOOP-2404:
--

> "some processing" of exactly these parameters was introduced in HADOOP-1085. 
> I opposed it then. You just committed it.

That looks like I made a mistake.  Mea culpa.  I don't recall the details, but 
in those days I was doing a lot of commits and my reviews may have sufferered.

> But I do not agree that they should be introdueced in this patch, which will 
> lead to massive changes

I disagree that the changes are massive.  They're easy to locate (points where 
the modified  parameters are accessed) not that many locations, and only affect 
a line or two of code at each location.  I also disagree that the size alone of 
the change should be a significant factor here.  The change is simple enough 
that it will not be destabilizing.  The places changed are not likely to be 
touched by many other pending patches, so it should not create many conflicts.

> This argument is going on for almost a month now. I do not find it productive.
> I mean, people can have different opinions, what do you do with that.

If committers cannot reach consensus, then the issue can be taken to the PMC, 
although that seems like overkill in this case.  If you decline to fix it in a 
way that others approve, and it is a blocker, then someone else must develop a 
patch that we can all agree on before we can make the release.  I think you are 
the best qualified person to fix this.  I could try to generate a patch, but it 
would probably take me a lot longer than it would you and I would be more 
likely to make subtle errors, since I am less intimate with the changes.

> HADOOP-2185 breaks compatibility with hadoop-0.15.0
> ---
>
> Key: HADOOP-2404
> URL: https://issues.apache.org/jira/browse/HADOOP-2404
> Project: Hadoop
>  Issue Type: Bug
>  Components: conf
>Affects Versions: 0.16.0
>Reporter: Arun C Murthy
>Assignee: Konstantin Shvachko
>Priority: Blocker
> Fix For: 0.16.0
>
> Attachments: ConfigConvert.patch, ConfigConvert2.patch, 
> ConfigurationConverter.patch
>
>
> HADOOP-2185 removed the following configuration parameters:
> {noformat}
> dfs.secondary.info.port
> dfs.datanode.port
> dfs.info.port
> mapred.job.tracker.info.port
> tasktracker.http.port
> {noformat}
> and changed the following configuration parameters:
> {noformat}
> dfs.secondary.info.bindAddress
> dfs.datanode.bindAddress
> dfs.info.bindAddress
> mapred.job.tracker.info.bindAddress
> mapred.task.tracker.report.bindAddress
> tasktracker.http.bindAddress
> {noformat}
> without a backward-compatibility story.
> Lots are applications/cluster-configurations are prone to fail hence, we need 
> a way to keep things working as-is for 0.16.0 and remove them for 0.17.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2385) Validate configuration parameters

2008-01-09 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557431#action_12557431
 ] 

Doug Cutting commented on HADOOP-2385:
--

> Why setters need to be static?

Users need to, e.g., be able to set HDFS parameters on a JobConf.  We can get 
away with a single subclass of Configuration that has setters, but once we add 
a second, it would be impossible to create a single configuration instance that 
can configure multiple components.

> Why per-package, not per-component?

That's fine too.  You seemed to be complaining that classes were too specific 
for this case, so I said I was okay with per-package if you thought that more 
appropirate here, although perhaps that's too general for your taste in this 
case, and you'd rather separate, e.g., Namenode from Datanode parameters.  
That's fine with me too.  However I don't find the argument that FSNamesystem 
is already too big compelling.  That's a separate issue: it should perhaps be 
decomposed into multiple classes, and when that's done, configuration accessors 
might move around, but if there are FSNamesystem-specific configuration 
accessors then I'd argue they belong in FSNamesystem, regardless of that 
class's current size.

> Validate configuration parameters
> -
>
> Key: HADOOP-2385
> URL: https://issues.apache.org/jira/browse/HADOOP-2385
> Project: Hadoop
>  Issue Type: Improvement
>  Components: dfs
>Affects Versions: 0.16.0
>Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data 
> nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled 
> parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2528) check permissions for job inputs and outputs

2008-01-09 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557428#action_12557428
 ] 

Doug Cutting commented on HADOOP-2528:
--

> In this particular jira, is it OK that we create the output directory by the 
> job client?

+1  That would make this patch very simple, not much more than one line!

However, we should not lose some of the changes to FileSystem.java, those 
deprecating all of the listPaths() signatures, and adding a listStatus(Path, 
Filter) signature.  Should we add a separate issue for those, or fix them as a 
part of HADOOP-2566?


> check permissions for job inputs and outputs
> 
>
> Key: HADOOP-2528
> URL: https://issues.apache.org/jira/browse/HADOOP-2528
> Project: Hadoop
>  Issue Type: Improvement
>  Components: mapred
>Reporter: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2528-0.patch, HADOOP-2528-1.patch
>
>
> On job submission, filesystem permissions should be checked to ensure that 
> the input directory is readable and that the output directory is writable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-09 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2567:
-

Attachment: HADOOP-2567-1.patch

Fix a test case that assumed getWorkingDir() was not fully qualified.

Note that because of this change (working dirs are now fully qualified) this 
change should probably be included in the "incompatible" section.


> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2567-1.patch, HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2560) Combining multiple input blocks into one mapper

2008-01-09 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557412#action_12557412
 ] 

Doug Cutting commented on HADOOP-2560:
--

> It is not going to work to combine splits statically because block replicas 
> are not co-resident.

If the number of blocks in the job input is hugely greater than the number of 
nodes, then it should be easy to find nodes that have a large number of blocks 
locally, and group the blocks thusly into tasks.  If a task fails, then the 
re-execution might not be local, but most tasks don't fail, and we can arrange 
things so that the first node a task is assigned to has all its blocks.  Or am 
i missing something?

Consider the following algorithm:
- build  and  maps for the job input files
- N is the desired blocks/task
- for (node : nodes) pop N blocks off each nodes list and add it to the list of 
tasks
- as each block is popped, also remove it from all other node's lists, using 
the other map to accelerate this
- repeat until nodes have fewer than N blocks, then emit tasks with fewer than 
N blocks as the tail of the job

Wouldn't that work?



> Combining multiple input blocks into one mapper
> ---
>
> Key: HADOOP-2560
> URL: https://issues.apache.org/jira/browse/HADOOP-2560
> Project: Hadoop
>  Issue Type: Bug
>Reporter: Runping Qi
>
> Currently, an input split contains a consecutive chunk of input file, which 
> by default, corresponding to a DFS block.
> This may lead to a large number of mapper tasks if the input data is large. 
> This leads to the following problems:
> 1. Shuffling cost: since the framework has to move M * R map output segments 
> to the nodes running reducers, 
> larger M means larger shuffling cost.
> 2. High JVM initialization overhead
> 3. Disk fragmentation: larger number of map output files means lower read 
> throughput for accessing them.
> Ideally, you want to keep the number of mappers to no more than 16 times the 
> number of  nodes in the cluster.
> To achive that, we can increase the input split size. However, if a split 
> span over more than one dfs block,
> you lose the data locality scheduling benefits.
> One way to address this problem is to combine multiple input blocks with the 
> same rack into one split.
> If in average we combine B blocks into one split, then we will reduce the 
> number of mappers by a factor of B.
> Since all the blocks for one mapper share a rack, thus we can benefit from 
> rack-aware scheduling.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-09 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2567:
-

Status: Patch Available  (was: Open)

> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-09 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting reassigned HADOOP-2567:


Assignee: Doug Cutting

> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-09 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2567:
-

Attachment: HADOOP-2567.patch

Patch that implements this.  Also makes both home and working dirs fully 
qualified.

> add FileSystem#getHomeDirectory() method
> 
>
> Key: HADOOP-2567
> URL: https://issues.apache.org/jira/browse/HADOOP-2567
> Project: Hadoop
>  Issue Type: New Feature
>  Components: fs
>Reporter: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2567.patch
>
>
> The FileSystem API would benefit from a getHomeDirectory() method.
> The default implementation would return "/user/$USER/".
> RawLocalFileSystem would return System.getProperty("user.home").
> HADOOP-2514 can use this to implement per-user trash.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-2567) add FileSystem#getHomeDirectory() method

2008-01-09 Thread Doug Cutting (JIRA)

add FileSystem#getHomeDirectory() method


 Key: HADOOP-2567
 URL: https://issues.apache.org/jira/browse/HADOOP-2567
 Project: Hadoop
  Issue Type: New Feature
  Components: fs
Reporter: Doug Cutting
 Fix For: 0.16.0


The FileSystem API would benefit from a getHomeDirectory() method.

The default implementation would return "/user/$USER/".

RawLocalFileSystem would return System.getProperty("user.home").

HADOOP-2514 can use this to implement per-user trash.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2268) JobControl classes should use interfaces rather than implemenations

2008-01-09 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557403#action_12557403
 ] 

Doug Cutting commented on HADOOP-2268:
--

+1 This patch looks fine to me.

> JobControl classes should use interfaces rather than implemenations
> ---
>
> Key: HADOOP-2268
> URL: https://issues.apache.org/jira/browse/HADOOP-2268
> Project: Hadoop
>  Issue Type: Improvement
>  Components: mapred
>Affects Versions: 0.15.0
>Reporter: Adrian Woodhead
>Assignee: Adrian Woodhead
>Priority: Minor
> Fix For: 0.16.0
>
> Attachments: HADOOP-2268-1.patch, HADOOP-2268-2.patch, 
> HADOOP-2268-3.patch, HADOOP-2268-4.patch
>
>
> See HADOOP-2202 for background on this issue. Arun C. Murthy agrees that when 
> possible it is preferable to program against the interface rather than a 
> concrete implementation (more flexible, allows for changes of the 
> implementation in future etc.) JobControl currently exposes running, waiting, 
> ready, successful and dependent jobs as ArrayList rather than List. I propose 
> to change this to List.
> I will code up a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-2566) need FileSystem#globStatus method

2008-01-09 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting reassigned HADOOP-2566:


Assignee: Hairong Kuang

> need FileSystem#globStatus method
> -
>
> Key: HADOOP-2566
> URL: https://issues.apache.org/jira/browse/HADOOP-2566
> Project: Hadoop
>  Issue Type: Improvement
>  Components: fs
>Reporter: Doug Cutting
>Assignee: Hairong Kuang
> Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting 
> performance, we must use file enumeration APIs that return FileStatus[] 
> rather than Path[].  Currently we have FileSystem#globPaths(), but that 
> method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the 
> cache in 0.17.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-2514) Trash and permissions don't mix

2008-01-09 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting reassigned HADOOP-2514:


Assignee: Doug Cutting

> Trash and permissions don't mix
> ---
>
> Key: HADOOP-2514
> URL: https://issues.apache.org/jira/browse/HADOOP-2514
> Project: Hadoop
>  Issue Type: New Feature
>  Components: dfs
>Affects Versions: 0.16.0
>Reporter: Robert Chansler
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
>
> Shell command "rm" is really "mv" to trash with the expectation that the 
> server will at some point really delete the contents of trash. With the 
> advent of permissions, a user can "mv" folders that the user cannot "rm". The 
> present trash feature as implemented would allow the user to suborn the 
> server into deleting a folder in violation of the permissions model.
> A related issue is that if anybody can mv a folder to the trash anybody else 
> can mv that same folder from the trash. This may be contrary to the 
> expectations of the user.
> What is a better model for trash?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2564) NameNode to blat total number of files and blocks

2008-01-09 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557379#action_12557379
 ] 

Doug Cutting commented on HADOOP-2564:
--

This was included in HADOOP-2447, just committed, no?

If that's satisfactory, we can close this as "duplicate".


> NameNode to blat total number of files and blocks
> -
>
> Key: HADOOP-2564
> URL: https://issues.apache.org/jira/browse/HADOOP-2564
> Project: Hadoop
>  Issue Type: Improvement
>Reporter: Marco Nicosia
>Priority: Minor
> Fix For: 0.17.0
>
>
> Right now, the namenode reports lots of rates (block read per sec, removed 
> per sec, etc etc) but it doesn't actually report how many files and blocks 
> total exist in the system. It'd be great if we could have this, so that our 
> reporting systems can show the growth trends over time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-2566) need FileSystem#globStatus method

2008-01-09 Thread Doug Cutting (JIRA)

need FileSystem#globStatus method
-

 Key: HADOOP-2566
 URL: https://issues.apache.org/jira/browse/HADOOP-2566
 Project: Hadoop
  Issue Type: Improvement
  Components: fs
Reporter: Doug Cutting
 Fix For: 0.16.0


To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting 
performance, we must use file enumeration APIs that return FileStatus[] rather 
than Path[].  Currently we have FileSystem#globPaths(), but that method should 
be deprecated and replaced with a FileSystem#globStatus().

We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the 
cache in 0.17.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-2565) DFSPath cache of FileStatus can become stale

2008-01-09 Thread Doug Cutting (JIRA)

DFSPath cache of FileStatus can become stale


 Key: HADOOP-2565
 URL: https://issues.apache.org/jira/browse/HADOOP-2565
 Project: Hadoop
  Issue Type: Bug
Affects Versions: 0.16.0
Reporter: Doug Cutting
 Fix For: 0.17.0


Paths returned from DFS internally cache their FileStatus, so that 
getStatus(Path) does not require another RPC.  This cache is never refreshed 
and become stale, resulting in program error.

This should not be fixed until FileSystem#listStatus() is removed by 
HADOOP-2563, and user code is thus no longer dependent on this cache for good 
performance.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-2563) Remove deprecated FileSystem#listPaths()

2008-01-09 Thread Doug Cutting (JIRA)

Remove deprecated FileSystem#listPaths()


 Key: HADOOP-2563
 URL: https://issues.apache.org/jira/browse/HADOOP-2563
 Project: Hadoop
  Issue Type: Improvement
  Components: fs
Reporter: Doug Cutting
 Fix For: 0.17.0


FileSystem#listPaths() has been deprecated for a few releases, and we should 
now remove it, upgrading everything to use FileSystem#listStatus().


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2404) HADOOP-2185 breaks compatibility with hadoop-0.15.0

2008-01-09 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557369#action_12557369
 ] 

Doug Cutting commented on HADOOP-2404:
--

> I thought and still think it is more fair not to provide any backward 
> compatibility at all [ ... ]

Huh?  That's a change from what you stated in [#action_12550831].  No one is 
asking for 100% back-compatibility here, but rather for a reasonable 
interpretation where possible of configuration parameters that have changed.  
At the very least, if we can easily detect that someone is using a feature that 
has been incompatibly changed, we should attempt to emit a warning, and not 
just let things mysteriously fail, no?

> I understand your irritation on the configuration issues, but I don't 
> understand why blame my or equally any other patch for not dealing with them.

You imply that I am asking this issue to fix a few instances of a widespread 
problem unrelated to the issue.  That is not the case.  The issue is both 
specific and related.

If a config parameter is only read in a single place, then no accessor method 
is needed.  If it is simply read in multiple places, then an accessor method is 
nice, since it helps prevent misspellings and makes things easier if the 
parameter ever requires more processing, but not mandatory.  Once some 
processing is needed for every access to a parameter then an accessor method is 
required, since otherwise we'd replicate non-trivial program logic.  
HADOOP-2185 pushed several parameters past this threshold, since 
back-compatibility processing is now required when these parameters are 
accessed, and thus accessor methods must be added.


> HADOOP-2185 breaks compatibility with hadoop-0.15.0
> ---
>
> Key: HADOOP-2404
> URL: https://issues.apache.org/jira/browse/HADOOP-2404
> Project: Hadoop
>  Issue Type: Bug
>  Components: conf
>Affects Versions: 0.16.0
>Reporter: Arun C Murthy
>Assignee: Konstantin Shvachko
>Priority: Blocker
> Fix For: 0.16.0
>
> Attachments: ConfigConvert.patch, ConfigConvert2.patch, 
> ConfigurationConverter.patch
>
>
> HADOOP-2185 removed the following configuration parameters:
> {noformat}
> dfs.secondary.info.port
> dfs.datanode.port
> dfs.info.port
> mapred.job.tracker.info.port
> tasktracker.http.port
> {noformat}
> and changed the following configuration parameters:
> {noformat}
> dfs.secondary.info.bindAddress
> dfs.datanode.bindAddress
> dfs.info.bindAddress
> mapred.job.tracker.info.bindAddress
> mapred.task.tracker.report.bindAddress
> tasktracker.http.bindAddress
> {noformat}
> without a backward-compatibility story.
> Lots are applications/cluster-configurations are prone to fail hence, we need 
> a way to keep things working as-is for 0.16.0 and remove them for 0.17.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2385) Validate configuration parameters

2008-01-09 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557359#action_12557359
 ] 

Doug Cutting commented on HADOOP-2385:
--

> The Configuration itself should remain the same for each component.
> It just exposes get methods specific to the component.

Yes, that would work for getters, but not for setters.  In many cases we need 
setters too, and it would be confusing to implement getters and setters using 
different styles.  Setters are best implemented as static methods, thus, for 
symmetry, getters must be also.

> I do not support the idea of placing static getters for configuration 
> parameters in the (top-level) component

I'm okay having per-package config classes (e.g.m DFSConfig) that centralizes 
configuration setters and getters for that package, since, in some cases, the 
classes which consume these (e.g., FSNamesystem) are not public classes.


> Validate configuration parameters
> -
>
> Key: HADOOP-2385
> URL: https://issues.apache.org/jira/browse/HADOOP-2385
> Project: Hadoop
>  Issue Type: Improvement
>  Components: dfs
>Affects Versions: 0.16.0
>Reporter: Robert Chansler
>
> Configuration parameters should be fully validated before name nodes or data 
> nodes begin service.
> Required parameters must be present.
> Required and optional parameters must have values of proper type and range.
> Undefined parameters must not be present.
> (I was recently observing some confusion whose root cause was a mis-spelled 
> parameter.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2560) Combining multiple input blocks into one mapper

2008-01-09 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557340#action_12557340
 ] 

Doug Cutting commented on HADOOP-2560:
--

> combine multiple input blocks with the same rack into one split [ ... ]

That makes good sense to me.  The new Split class could look a lot like 
MultiFileSplit, but would additionally support a 'getStart(int)' method.  So 
perhaps MultiFileSplit could be extended for this purpose.  FileInputFormat 
could be modified to emit these when the number of splits would otherwise 
exceed some threshold.  But then all subclasses of FileInputFormat would need 
to be modified to be able to accept these.  That wouldn't be too hard.  
FileInputFormat could implement getRecordReader(InputSplit) to break out the 
sub-splits, then call a new method, getRecordReader(FileSplit).  All existing 
subclasses could then just change the signature of their getRecordReader 
implementations in order to support the new feature.


> Combining multiple input blocks into one mapper
> ---
>
> Key: HADOOP-2560
> URL: https://issues.apache.org/jira/browse/HADOOP-2560
> Project: Hadoop
>  Issue Type: Bug
>Reporter: Runping Qi
>
> Currently, an input split contains a consecutive chunk of input file, which 
> by default, corresponding to a DFS block.
> This may lead to a large number of mapper tasks if the input data is large. 
> This leads to the following problems:
> 1. Shuffling cost: since the framework has to move M * R map output segments 
> to the nodes running reducers, 
> larger M means larger shuffling cost.
> 2. High JVM initialization overhead
> 3. Disk fragmentation: larger number of map output files means lower read 
> throughput for accessing them.
> Ideally, you want to keep the number of mappers to no more than 16 times the 
> number of  nodes in the cluster.
> To achive that, we can increase the input split size. However, if a split 
> span over more than one dfs block,
> you lose the data locality scheduling benefits.
> One way to address this problem is to combine multiple input blocks with the 
> same rack into one split.
> If in average we combine B blocks into one split, then we will reduce the 
> number of mappers by a factor of B.
> Since all the blocks for one mapper share a rack, thus we can benefit from 
> rack-aware scheduling.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2510) Map-Reduce 2.0

2008-01-08 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557066#action_12557066
 ] 

Doug Cutting commented on HADOOP-2510:
--

The stated goals of this design are to improve things when running mapreduce on 
a subset of the nodes of a cluster, when HDFS is run on all nodes.  The current 
approach is to run new mapreduce daemons (jobtracker and tasktrackers) for the 
subset.  The problems are that this does not utilize nodes as fully as they 
could be (e.g., during the tail of a job) and it inhibits data locality 
optimizations.

The proposed solution is to split the jobtracker daemon in two, one shared, 
long-running daemon, and a per job daemon.  My concern with this approach is 
that adding a new kind of daemon considerably complicates things.  New classes 
of daemons exponentially increase the number of failure modes that must be 
tested and debugged.  This could be warranted if it permitted greater sharing 
of functionality between systems, reducing the amount of functionality that we 
must maintain.  For example, we could add a general node allocation system, and 
built map-reduce on top of this.  But for that to be a convincingly independent 
layer, we'd need to demonstrate that we can build other, non-mapreduce systems 
on it, e.g., perhaps hdfs, but this proposal doesn't seem to offer that.

I propose that the stated problems can be more simply and directly solved 
without adding a new daemon, but with the existing integrated system.  We can 
add a job parameter naming the maximum number of nodes that will be used 
simultaneously.  Then a single jobtracker for the entire cluster can schedule 
tasks for multiple jobs at a time, each running on different subsets of nodes.  
A cluster of 1000 nodes might be configured to limit jobs to 200 nodes each.  
As jobs are winding down and no longer use all 200 nodes, the next job can use 
those nodes, improving utilization, the first stated goal of this issue.  The 
entire cluster is available to the jobtracker for scheduling, so that it can 
arrange to place tasks on nodes where their data is local, addressing the 
second stated goal of this issue.

Splitting the jobtracker sounds like it would simplify things, since it would 
result in two simpler services, but distributed systems are more impacted by 
the number of kinds of services than by the complexity of a single service.  
Thus perhaps the jobtracker could be better structured internally, to separate 
concerns within its implementation, but I do not yet see an argument for moving 
them to separate services.  That seems like it will only make things less 
reliable: the same logic running in two daemons that could run equivalently in 
a single daemon.

> Map-Reduce 2.0
> --
>
> Key: HADOOP-2510
> URL: https://issues.apache.org/jira/browse/HADOOP-2510
> Project: Hadoop
>  Issue Type: Improvement
>  Components: mapred
>Reporter: Arun C Murthy
>
> We, at Yahoo!, have been using Hadoop-On-Demand as the resource 
> provisioning/scheduling mechanism. 
> With HoD the user uses a self-service system to ask-for a set of nodes. HoD 
> allocates these from a global pool and also provisions a private Map-Reduce 
> cluster for the user. She then runs her jobs and shuts the cluster down via 
> HoD when done. All user-private clusters use the same humongous, static HDFS 
> (e.g. 2k node HDFS). 
> More details about HoD are available here: HADOOP-1301.
> 
> h3. Motivation
> The current deployment (Hadoop + HoD) has a couple of implications:
>  * _Non-optimal Cluster Utilization_
>1. Job-private Map-Reduce clusters imply that the user-cluster potentially 
> could be *idle* for atleast a while before being detected and shut-down.
>2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with 
> much-smaller no. of reduces; with maps being light and quick and reduces 
> being i/o heavy and longer-running. Users typically allocate clusters 
> depending on the no. of maps (i.e. input size) which leads to the scenario 
> where all the maps are done (idle nodes in the cluster) and the few reduces 
> are chugging along. Right now, we do not have the ability to shrink the 
> HoD'ed Map-Reduce clusters which would alleviate this issue. 
>  * _Impact on data-locality_
> With the current setup of a static, large HDFS and much smaller (5/10/20/50 
> node) clusters there is a good chance of losing one of Map-Reduce's primary 
> features: ability to execute tasks on the datanodes where the input splits 
> are located. In fact, we have seen the data-local tasks go down to 20-25 
> percent in the GridMix benchmarks, from the 95-98 percent we see on the 
> randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a 
> synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware 
> Map-Reduc

[jira] Commented: (HADOOP-1824) want InputFormat for zip files

2008-01-08 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557061#action_12557061
 ] 

Doug Cutting commented on HADOOP-1824:
--

> 2. Override the getSplits() method to read each file's InputStream

I think getSplits() should construct a split for each element of 
java.util.zip.ZipFile#entries().

> 3. Create FileSplits [ ... ]

We should probably extend FileSplit or InputSplit specifically for zip files.  
The fields needed per split are the archive file's path and the path of the 
file within the archive.  I don't think there's much point in supporting splits 
smaller than a file within the zip archive, so start and end offsets are not 
required here.

> 4. Implement class ZipRecordReader to read each zip entry in its split
Using LineRecordReader.

We should be able to use LineRecordReader directly, passing its constructor the 
result of ZipFile#getInputStream().



> want InputFormat for zip files
> --
>
> Key: HADOOP-1824
> URL: https://issues.apache.org/jira/browse/HADOOP-1824
> Project: Hadoop
>  Issue Type: New Feature
>  Components: mapred
>Reporter: Doug Cutting
>
> HDFS is inefficient with large numbers of small files.  Thus one might pack 
> many small files into large, compressed, archives.  But, for efficient 
> map-reduce operation, it is desireable to be able to split inputs into 
> smaller chunks, with one or more small original file per split.  The zip 
> format, unlike tar, permits enumeration of files in the archive without 
> scanning the entire archive.  Thus a zip InputFormat could efficiently permit 
> splitting large archives into splits that contain one or more archived files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2552) enable hdfs permission checking by default

2008-01-08 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2552:
-

Attachment: HADOOP-2552.patch

> enable hdfs permission checking by default
> --
>
> Key: HADOOP-2552
> URL: https://issues.apache.org/jira/browse/HADOOP-2552
> Project: Hadoop
>  Issue Type: Improvement
>  Components: dfs
>Reporter: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2552.patch
>
>
> We should enable permission checking in dfs by default.  Currently, on 
> upgrade, all file permissions are 777, so this is a back-compatible change.  
> After an upgrade folks can change owners and groups and limit permissions, 
> and things will work as expected.
> The current default, dfs.permissions=false, gives inconsistent behaviour: 
> permissions are displayed in 'ls' and returned by the FileSystem APIs, but 
> they're not enforced.  In future releases we will certainly want 
> dfs.permissions=true to be the default, and making it so now will thus also 
> avoid an incompatible change.
> dfs.permissions=false should be an optional, non-default configuration that 
> some sites may decide to use.  It is further defined in HADOOP-2543.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2552) enable hdfs permission checking by default

2008-01-08 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2552:
-

Assignee: Doug Cutting
  Status: Patch Available  (was: Open)

> enable hdfs permission checking by default
> --
>
> Key: HADOOP-2552
> URL: https://issues.apache.org/jira/browse/HADOOP-2552
> Project: Hadoop
>  Issue Type: Improvement
>  Components: dfs
>Reporter: Doug Cutting
>Assignee: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2552.patch
>
>
> We should enable permission checking in dfs by default.  Currently, on 
> upgrade, all file permissions are 777, so this is a back-compatible change.  
> After an upgrade folks can change owners and groups and limit permissions, 
> and things will work as expected.
> The current default, dfs.permissions=false, gives inconsistent behaviour: 
> permissions are displayed in 'ls' and returned by the FileSystem APIs, but 
> they're not enforced.  In future releases we will certainly want 
> dfs.permissions=true to be the default, and making it so now will thus also 
> avoid an incompatible change.
> dfs.permissions=false should be an optional, non-default configuration that 
> some sites may decide to use.  It is further defined in HADOOP-2543.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-2552) enable hdfs permission checking by default

2008-01-08 Thread Doug Cutting (JIRA)

enable hdfs permission checking by default
--

 Key: HADOOP-2552
 URL: https://issues.apache.org/jira/browse/HADOOP-2552
 Project: Hadoop
  Issue Type: Improvement
  Components: dfs
Reporter: Doug Cutting
 Fix For: 0.16.0


We should enable permission checking in dfs by default.  Currently, on upgrade, 
all file permissions are 777, so this is a back-compatible change.  After an 
upgrade folks can change owners and groups and limit permissions, and things 
will work as expected.

The current default, dfs.permissions=false, gives inconsistent behaviour: 
permissions are displayed in 'ls' and returned by the FileSystem APIs, but 
they're not enforced.  In future releases we will certainly want 
dfs.permissions=true to be the default, and making it so now will thus also 
avoid an incompatible change.

dfs.permissions=false should be an optional, non-default configuration that 
some sites may decide to use.  It is further defined in HADOOP-2543.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2532) Add to MapFile a getClosest that returns key that comes just-before if key not present (Currently does just-after only).

2008-01-08 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557008#action_12557008
 ] 

Doug Cutting commented on HADOOP-2532:
--

+1 This looks fine to me.

> Add to MapFile a getClosest that returns key that comes just-before if key 
> not present (Currently does just-after only).
> 
>
> Key: HADOOP-2532
> URL: https://issues.apache.org/jira/browse/HADOOP-2532
> Project: Hadoop
>  Issue Type: New Feature
>Reporter: stack
>Assignee: stack
>Priority: Minor
> Fix For: 0.16.0
>
> Attachments: getclosestbefore-v2.patch, getclosestbefore-v3.patch, 
> getclosestbefore.patch
>
>
> The list of regions that make up a table in hbase are effectively kept in a 
> mapfile.  Regions are identified by the first row contained by that region.   
> To find the region that contains a particular row, we need to be able to 
> search the mapfile of regions to find the closest matching row that falls 
> just-before the searched-for key rather than the just-after that is current 
> mapfile getClosest behavior.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2206) Design/implement a general log-aggregation framework for Hadoop

2008-01-08 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556991#action_12556991
 ] 

Doug Cutting commented on HADOOP-2206:
--

> I got Arun a copy of Scribe a few months ago.

Any chance you can post a public copy somewhere?

> Design/implement a general log-aggregation framework for Hadoop
> ---
>
> Key: HADOOP-2206
> URL: https://issues.apache.org/jira/browse/HADOOP-2206
> Project: Hadoop
>  Issue Type: New Feature
>  Components: dfs, mapred
>Reporter: Arun C Murthy
>Assignee: Arun C Murthy
> Fix For: 0.17.0
>
>
> I'd like to propose a log-aggregation framework which facilitates collection, 
> aggregation and storage of the logs of the Hadoop Map-Reduce framework and 
> user-jobs in HDFS. Clearly the design/implementation of this framework is 
> heavily influenced and limited by Hadoop itself for e.g. lack of appends, not 
> too many small files (think: stdout/stderr/syslog of each map/reduce task) 
> and so on. 
> This framework will be especially useful once HoD (HADOOP-1301) is used to 
> provision dynamic, per-user, Map-Reduce clusters.
> h4. Requirements:
> *  Store the various logs to a configurable location in the Hadoop 
> Distributed FileSystem
> ** User task logs (stdout, stderr, syslog)
> ** Map-Reduce daemons' logs (JobTracker and TaskTracker)
> * Integrate well with Hadoop and ensure no adverse performance impact on the 
> Map-Reduce framework.
> * It must not use a HDFS file (or more!) per a task, which would swamp the 
> NameNode capabilities.
> * The aggregation system must be distributed and reliable.
> * Facilities/tools to read the aggregated logs.
> * The aggregated logs should be compressed.
> h4. Architecture:
> Here is a high-level overview of the log-aggregation framework:
> h5. Logging
> * Provision a cloud of log-aggregators in the cluster (outside of the Hadoop 
> cluster, running on the subset of nodes in the cluster). Lets call each one 
> in the cloud as a Log Aggregator i.e. LA.
> * Each LA writes out 2 files per Map-Reduce cluster: an index file and a data 
> file. The LA maintains one directory per Map-Reduce cluster on HDFS.
> * The index file format is simple:
> ** streamid (_streamid_ is either daemon identifier e.g. 
> tasktracker_foo.bar.com:57891 or $jobid-$taskid-(stdout|stderr|syslog) or 
> individual task-logs)
> ** timestamp
> ** logs-data start offset
> ** no. of bytes
> * Each Hadoop daemon (JT/TT) is given the entire list of LAs in the cluster.
> * Each daemon picks one LA (at random) from the list, opens an exclusive 
> stream with the LA after identifying itself (i.e. ${daemonid}) and sends it's 
> logs. In case of error/failure to log it just connects to another LA as above 
> and starts logging to it.
> * The logs are sent to the LA by a new log4j appender. The appender provides 
> some amount of buffering on the client-side.
> * Implement a feature in the TaskTracker which lets it use the same appender 
> to send out the userlogs (stdout/stderr/syslog) to the LA after task 
> completion. This is important to ensure that logging to the LA at runtime 
> doesn't hurt the task's performance (see HADOOP-1553). The TaskTracker picks 
> an LA per task in a manner similar to the one it uses for it's own logs, 
> identifies itself (<${jobid}, ${taskid}, {stdout|stderr|syslog}>) and streams 
> the entire task-log at one go. In fact we can pick different LAs for each of 
> the task's stdout, stderr and syslog logs - each an exclusive stream to a 
> single LA.
> * The LA buffers some amount of data in memory (say 16K) and then flushes 
> that data to the HDFS file (per LA per cluster) after writing out an entry to 
> the index file.
> * The LA periodically purges old logs (monthly, fortnightly or weekly as 
> today). 
> h5. Getting the logged information
> The main requirement is to implement a simple set of tools to query the LA 
> (i.e. the index/data files on HDFS) to glean the logged information.
> If we can think of each Map-Reduce cluster's logs as a set of archives (i.e. 
> one file per cluster per LA used) we need the ability to query the 
> log-archive to figure out the available streams and the ability to get one 
> entire stream or a subset of time based on timestamp-ranges. Essentially 
> these are simple tools which parse the index files of each LA (for a given 
> Hadoop cluster) and return the required information.
> h6. Query for available streams
> The query just returns all the available streams in an cluster-log archive 
> identified by the HDFS path.
> It looks something like this for a cluster with 3 nodes which ran 2 jobs, 
> first of which had 2 maps, 1 reduce and the second had 1 map, 1 reduce:
> {noformat}
>$ la -query /log-aggregation/cluster-20071113
>Ava

[jira] Commented: (HADOOP-1873) User permissions for Map/Reduce

2008-01-08 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556989#action_12556989
 ] 

Doug Cutting commented on HADOOP-1873:
--

+1 this looks good to me.  Thanks for your patience in working this out!

> User permissions for Map/Reduce
> ---
>
> Key: HADOOP-1873
> URL: https://issues.apache.org/jira/browse/HADOOP-1873
> Project: Hadoop
>  Issue Type: Improvement
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
> Attachments: mapred.patch, mapred2.patch, mapred3.patch, 
> mapred4.patch, mapred5.patch, mapred6.patch, mapred7.patch
>
>
> HADOOP-1298 and HADOOP-1701 add permissions and pluggable security for DFS 
> files and DFS accesses. Same users permission should work for Map/Reduce jobs 
> as well. 
> User persmission should propegate from client to map/reduce tasks and all the 
> file operations should be subject to user permissions. This is transparent to 
> the user (i.e. no changes to user code should be required). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2551) hadoop-env.sh needs finer granularity

2008-01-08 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556986#action_12556986
 ] 

Doug Cutting commented on HADOOP-2551:
--

I don't think we need HADOOP_GLOBAL_OPTS, we can just use HADOOP_OPTS for that, 
but we could add a HADOOP_NAMENODE_OPTS that, when starting the namenode, is 
appended to HADOOP_OPTS, etc.  In general, we could modify bin/hadoop to add 
the value of HADOOP_{$COMMAND}_OPTS to HADOOP_OPTS.  Would that suffice?

> hadoop-env.sh needs finer granularity
> -
>
> Key: HADOOP-2551
> URL: https://issues.apache.org/jira/browse/HADOOP-2551
> Project: Hadoop
>  Issue Type: Improvement
>Reporter: Allen Wittenauer
>Priority: Minor
>
> We often configure our HADOOP_OPTS on the name node to have JMX running so 
> that we can do JVM monitoring.  But doing so means that we need to edit this 
> file if we want to run other hadoop commands, such as fsck.  It would be 
> useful if hadoop-env.sh was refactored a bit so that there were different 
> and/or cascading HADOOP_OPTS dependent upon which process/task was being 
> performed.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2531) HDFS FileStatus.getPermission() should return 777 when dfs.permissions=false

2008-01-08 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556983#action_12556983
 ] 

Doug Cutting commented on HADOOP-2531:
--

Okay, I found it: the default permissions on upgrade are 777, with both user 
and group set to HadoopAnonymous.  So I'm now leaning towards switching to 
dfs.permissions=true by default.

> HDFS FileStatus.getPermission() should return 777 when dfs.permissions=false
> 
>
> Key: HADOOP-2531
> URL: https://issues.apache.org/jira/browse/HADOOP-2531
> Project: Hadoop
>  Issue Type: Bug
>  Components: dfs
>Reporter: Doug Cutting
> Fix For: 0.16.0
>
>
> Generic permission checking code should still work correctly when 
> dfs.permissions=false.  Currently FileStatus#getPermission() returns the 
> actual permission when dfs.permissions=false on the namenode, which is 
> incorrect, since all accesses are permitted in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2531) HDFS FileStatus.getPermission() should return 777 when dfs.permissions=false

2008-01-08 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556981#action_12556981
 ] 

Doug Cutting commented on HADOOP-2531:
--

The use case of dfs.permissions=false was better explained to me yesterday.  It 
is intended to permit admins to set permissions after upgrade while leaving the 
filesystem available for use.  If this use case is really important, then we 
should mark this "won't fix".

Nigel expressed concerns about displaying permissions in "ls" that are not 
enforced would be confusing to users, that returning 777 would be better for 
that reason too.  But if dfs.permissions is only meant to be used during 
transition, this may not be a serious issue.

I'm beginning to think that dfs.permissions should be 'true' by default, and 
that the default permission on upgrade should be 777.  That is back-compatible. 
 Then, if folks like, they can set more prohibitive permissions and/or disable 
permission checking.  If this is the default behavior then I am okay marking 
this issue "won't fix".  Currently dfs.permissions is 'false' by default, so 
perhaps that should change.  I am not yet certain what the default file 
permission is after an upgrade...

> HDFS FileStatus.getPermission() should return 777 when dfs.permissions=false
> 
>
> Key: HADOOP-2531
> URL: https://issues.apache.org/jira/browse/HADOOP-2531
> Project: Hadoop
>  Issue Type: Bug
>  Components: dfs
>Reporter: Doug Cutting
> Fix For: 0.16.0
>
>
> Generic permission checking code should still work correctly when 
> dfs.permissions=false.  Currently FileStatus#getPermission() returns the 
> actual permission when dfs.permissions=false on the namenode, which is 
> incorrect, since all accesses are permitted in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2528) check permissions for job inputs and outputs

2008-01-08 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556976#action_12556976
 ] 

Doug Cutting commented on HADOOP-2528:
--

Raghu and Hairong yesterday raised a few relevant issues:
- the superuser name is not currently known on the client, and until it is, we 
can get false negatives in permission checks
- the goal of dfs.permissions=false is for admins to be able to set, examine 
and alter permissions before they are enforced, so that a filesystem may be 
upgraded and returned to service before permissions are completely configured.  
Returning 777 for all files when dfs.permissions=false would prohibit this use.

This patch, as it stands, fights a bit with that use case too.  If permission 
checking is disabled on the namenode, then there's a good chance that 
permissions are not yet correctly configured there, so checking them clientside 
may give the wrong results.  Thus the goal of permitting folks to run jobs 
while permissions are being configured may be defeated by this patch.

This patch was meant to be provocative: we're providing new APIs, but we have 
little real code that uses these new APIs. Mapreduce input/output validation 
seems like an obvious place to add permission checks, and hence an opportunity 
to check the usability of the APIs.

I'm currently on the fence as to whether this patch should be committed in 
0.16.  Once dfs.permisisons=true, it would be really nice to fail a job quickly 
if its output directory is not writable, without first running all of the maps. 
 Readability of input is less critical, since that will fail fairly quickly 
anyway.

Perhaps we should add a utility method that checks the writability of a 
directory by creating and removing an empty file.  This would be more reliably 
correct.  I'll create a new patch with this approach.

> check permissions for job inputs and outputs
> 
>
> Key: HADOOP-2528
> URL: https://issues.apache.org/jira/browse/HADOOP-2528
> Project: Hadoop
>  Issue Type: Improvement
>  Components: mapred
>Reporter: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2528-0.patch, HADOOP-2528-1.patch
>
>
> On job submission, filesystem permissions should be checked to ensure that 
> the input directory is readable and that the output directory is writable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1873) User permissions for Map/Reduce

2008-01-08 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556969#action_12556969
 ] 

Doug Cutting commented on HADOOP-1873:
--

Another option: In the FileSystem.create and FileSystem.mkdirs static utility 
methods, we might create the file or directory first, then set the protection.  
This has the disadvantage of making two RPC calls, but it has the advantage of 
being thread safe.  In the current case (job submission) the performance impact 
of these extra RPCs would be negligible, no?


> User permissions for Map/Reduce
> ---
>
> Key: HADOOP-1873
> URL: https://issues.apache.org/jira/browse/HADOOP-1873
> Project: Hadoop
>  Issue Type: Improvement
>Reporter: Raghu Angadi
>Assignee: Hairong Kuang
> Attachments: mapred.patch, mapred2.patch, mapred3.patch, 
> mapred4.patch, mapred5.patch, mapred6.patch
>
>
> HADOOP-1298 and HADOOP-1701 add permissions and pluggable security for DFS 
> files and DFS accesses. Same users permission should work for Map/Reduce jobs 
> as well. 
> User persmission should propegate from client to map/reduce tasks and all the 
> file operations should be subject to user permissions. This is transparent to 
> the user (i.e. no changes to user code should be required). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2514) Trash and permissions don't mix

2008-01-08 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556947#action_12556947
 ] 

Doug Cutting commented on HADOOP-2514:
--

I'm +1 for Sanjay's option 2 for 0.16.  

Note I don't beleive this issue should be a blocker, since the existing trash 
code will work with a globally writable /trash.  So we need to implement option 
2 before the freeze.

> Trash and permissions don't mix
> ---
>
> Key: HADOOP-2514
> URL: https://issues.apache.org/jira/browse/HADOOP-2514
> Project: Hadoop
>  Issue Type: New Feature
>  Components: dfs
>Affects Versions: 0.16.0
>Reporter: Robert Chansler
> Fix For: 0.16.0
>
>
> Shell command "rm" is really "mv" to trash with the expectation that the 
> server will at some point really delete the contents of trash. With the 
> advent of permissions, a user can "mv" folders that the user cannot "rm". The 
> present trash feature as implemented would allow the user to suborn the 
> server into deleting a folder in violation of the permissions model.
> A related issue is that if anybody can mv a folder to the trash anybody else 
> can mv that same folder from the trash. This may be contrary to the 
> expectations of the user.
> What is a better model for trash?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2528) check permissions for job inputs and outputs

2008-01-04 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556146#action_12556146
 ] 

Doug Cutting commented on HADOOP-2528:
--

> Should we enable permissions by default in DFS, at least through development 
> phase [ ...]

I think we should certainly encourage developers to do this, but I'm hesitant 
to change it in subversion.

> check permissions for job inputs and outputs
> 
>
> Key: HADOOP-2528
> URL: https://issues.apache.org/jira/browse/HADOOP-2528
> Project: Hadoop
>  Issue Type: Improvement
>  Components: mapred
>Reporter: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2528-0.patch, HADOOP-2528-1.patch
>
>
> On job submission, filesystem permissions should be checked to ensure that 
> the input directory is readable and that the output directory is writable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2528) check permissions for job inputs and outputs

2008-01-04 Thread Doug Cutting (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated HADOOP-2528:
-

Attachment: HADOOP-2528-1.patch

Here's an updated version of the patch.

I previously assumed that one could, e.g., read a file you own with rwx 
permissions, but in fact you can't.  If you're the owner, then only the owner 
permissions are examined.  I've updated the generic checker here to reflect 
that.  I learn something new every day!

> check permissions for job inputs and outputs
> 
>
> Key: HADOOP-2528
> URL: https://issues.apache.org/jira/browse/HADOOP-2528
> Project: Hadoop
>  Issue Type: Improvement
>  Components: mapred
>Reporter: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2528-0.patch, HADOOP-2528-1.patch
>
>
> On job submission, filesystem permissions should be checked to ensure that 
> the input directory is readable and that the output directory is writable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2514) Trash and permissions don't mix

2008-01-04 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556138#action_12556138
 ] 

Doug Cutting commented on HADOOP-2514:
--

> Trashing will be more efficent

I think it is premature to optimize this, especially if that involves 
complicating the namenode kernel.

> We are able to treat delete as delete not rename and therefore perform the 
> right permission checking.

I'm confused by this.  Moving something to the trash is not deleting it, it's 
moving it.  Don't we want folks to be able to move things out of the trash 
again?  So the trash needs to be a directory where the user can write things, 
and that permission must be validated on move-to-trash.  We might also check 
some other things, like whether the user has the right to delete those files, 
but that's just to keep folks from being surprised later if their trash isn't 
actually deleted.  Someone could still chmod something in the trash and get 
into the same situation.  To truly prevent that we'd need to make the trash 
into some sort of special purgatory directory with behavior like no other, no?


> Trash and permissions don't mix
> ---
>
> Key: HADOOP-2514
> URL: https://issues.apache.org/jira/browse/HADOOP-2514
> Project: Hadoop
>  Issue Type: New Feature
>  Components: dfs
>Affects Versions: 0.16.0
>Reporter: Robert Chansler
> Fix For: 0.16.0
>
>
> Shell command "rm" is really "mv" to trash with the expectation that the 
> server will at some point really delete the contents of trash. With the 
> advent of permissions, a user can "mv" folders that the user cannot "rm". The 
> present trash feature as implemented would allow the user to suborn the 
> server into deleting a folder in violation of the permissions model.
> A related issue is that if anybody can mv a folder to the trash anybody else 
> can mv that same folder from the trash. This may be contrary to the 
> expectations of the user.
> What is a better model for trash?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-2531) HDFS FileStatus.getPermission() should return 777 when dfs.permissions=false

2008-01-04 Thread Doug Cutting (JIRA)

HDFS FileStatus.getPermission() should return 777 when dfs.permissions=false


 Key: HADOOP-2531
 URL: https://issues.apache.org/jira/browse/HADOOP-2531
 Project: Hadoop
  Issue Type: Bug
  Components: dfs
Reporter: Doug Cutting
 Fix For: 0.16.0


Generic permission checking code should still work correctly when 
dfs.permissions=false.  Currently FileStatus#getPermission() returns the actual 
permission when dfs.permissions=false on the namenode, which is incorrect, 
since all accesses are permitted in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2528) check permissions for job inputs and outputs

2008-01-04 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556130#action_12556130
 ] 

Doug Cutting commented on HADOOP-2528:
--

> Whether a file is readable/writable also depends on if the user has 
> searchable permission on all ancestor directories

Isn't that already demonstrated by the fact that the file is returned from 
listStatus()?

> check permissions for job inputs and outputs
> 
>
> Key: HADOOP-2528
> URL: https://issues.apache.org/jira/browse/HADOOP-2528
> Project: Hadoop
>  Issue Type: Improvement
>  Components: mapred
>Reporter: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2528-0.patch
>
>
> On job submission, filesystem permissions should be checked to ensure that 
> the input directory is readable and that the output directory is writable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2514) Trash and permissions don't mix

2008-01-04 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556129#action_12556129
 ] 

Doug Cutting commented on HADOOP-2514:
--

> if the home directory does not exist, I am proposing that deletes move to a 
> common trash area.

Or the move-to-trash could fail with an exception in this case.

> Also note with the trashbin in /user//.trash, instead of 
> /trash/ the trashbin compacter will have to look in multiple home 
> dirs instead of merely in /trash.

Why is that bad?  It'll have to look in the same number of directories in 
either case, no?

> Unfortunately the client side code may find it expensive to do a rpc 
> per-subtree-entry when deleting a large subtree.

It's only an RPC per directory in the tree, not per file.

> Are you suggesting a per-user trashbin compacter running as the user?

No.  But we might have the emptier thread 'su' to each user as it loops through 
the trash directories so that the checking is implicit and only performed once. 
 I don't like using 'su'-like stuff much though.



> Trash and permissions don't mix
> ---
>
> Key: HADOOP-2514
> URL: https://issues.apache.org/jira/browse/HADOOP-2514
> Project: Hadoop
>  Issue Type: New Feature
>  Components: dfs
>Affects Versions: 0.16.0
>Reporter: Robert Chansler
> Fix For: 0.16.0
>
>
> Shell command "rm" is really "mv" to trash with the expectation that the 
> server will at some point really delete the contents of trash. With the 
> advent of permissions, a user can "mv" folders that the user cannot "rm". The 
> present trash feature as implemented would allow the user to suborn the 
> server into deleting a folder in violation of the permissions model.
> A related issue is that if anybody can mv a folder to the trash anybody else 
> can mv that same folder from the trash. This may be contrary to the 
> expectations of the user.
> What is a better model for trash?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2528) check permissions for job inputs and outputs

2008-01-04 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556123#action_12556123
 ] 

Doug Cutting commented on HADOOP-2528:
--

Let me be clear: permission checking of mapred inputs may not work very well 
yet.  But it should work in the 0.16 release.  It looks like when 
dfs.permissions=false that the returned file permissions are not all 777.  
That's perhaps a bug.  Either that, or the FileSystem#checkAccess() utility 
method added by this patch should somehow check whether permissions are enabled.

It is better to begin to address such issues sooner than later in this release 
cycle.  If we advertise that file permissions are implemented in this release, 
then we ought to attempt to make sure that they're usable, no?  Checking 
permissions while checking existence of inputs seems like a reasonable thing to 
be able to do, should have no new significant performance impact, and causes us 
to work some of these things out.

> check permissions for job inputs and outputs
> 
>
> Key: HADOOP-2528
> URL: https://issues.apache.org/jira/browse/HADOOP-2528
> Project: Hadoop
>  Issue Type: Improvement
>  Components: mapred
>Reporter: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2528-0.patch
>
>
> On job submission, filesystem permissions should be checked to ensure that 
> the input directory is readable and that the output directory is writable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2528) check permissions for job inputs and outputs

2008-01-04 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-2528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556119#action_12556119
 ] 

Doug Cutting commented on HADOOP-2528:
--

> What if permissions checking is disabled (trunk currently allows it. it is in 
> fact the default)?

Yes, it is possible to disable HDFS permission checks.  But shouldn't generic 
permission checking code still work?  We don't want every bit of code that uses 
filesystem permissions to first have to check if permission checking is 
enabled.  Rather, generic permission checking code should be a no-op when 
permission checking is disabled in a particular filesystem implementation.

> This looks similar to how DFS used to invoke 'exits(file)' before opening a 
> file.

Again, this patch causes no new HDFS RPC calls to be made.  It just checks the 
new values now returned.  You might argue that we should disable all input and 
output checks, but that should be done in a separate issue.  Input and output 
checking were added since folks preferred to find out sooner when their jobs 
were destined to fail.  Perhaps with splits generated client-side now input 
checking is less critical.  But checking the output directory is probably still 
of great value.

> I don't think client alone can decide if a particular access is allowed.

The value of FileStatus.getPermission() is never null.  It should either be 
"777" or the correct value for filesystems that implement permission checking.

> check permissions for job inputs and outputs
> 
>
> Key: HADOOP-2528
> URL: https://issues.apache.org/jira/browse/HADOOP-2528
> Project: Hadoop
>  Issue Type: Improvement
>  Components: mapred
>Reporter: Doug Cutting
> Fix For: 0.16.0
>
> Attachments: HADOOP-2528-0.patch
>
>
> On job submission, filesystem permissions should be checked to ensure that 
> the input directory is readable and that the output directory is writable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3051 matches

Mail list logo