Enhance streaming to use the new caching feature ------------------------------------------------
Key: HADOOP-576 URL: http://issues.apache.org/jira/browse/HADOOP-576 Project: Hadoop Issue Type: Improvement Components: contrib/streaming Reporter: Michel Tourn Design proposal to expose filecache access to Hadoop streaming. The main difference with the pure-Java filecache code is: 1. As part of job launch (in hadoopStreaming client) we validate presence of cached archives/files in DFS. 2. As part of Task initialization, a symbolic link to cached files/unarchived directories is created in the Task working directory. C1. New command-line options (example) -cachearchive dfs:/user/me/big.zip#big_1 -cachefile dfs:/user/other/big.zip#big_2 -cachearchive dfs:/user/me/bang.zip This maps to API calls to static methods: DistributedCache.addCacheArchive(URI uri, Configuration conf) DistributedCache.addCacheFile(URI uri, Configuration conf) This is done in class StreamJob methods parseArgv() and setJobConf(). The code should be similar to the way "-file" is handled. One difference is that we now require a FileSystem instance to VALIDATE the DFS paths in -cachefile and -cachearchive. The FileSystem instance should not be accessed before the filesystem is set by this: setUserJobConfProps(true); If FileSystem instance is "local" and there are -cachearchive/-cachefile options , then fail: this is not supported. Else this should return true: fs_.isFile(Path) for each -cachearchive/-cachefile option. Only in verbose mode: show the isFile status of each option. In any verbosity mode: show the first failed isFile() status and abort using method StreamJob.fail(). C2. Task initialization The symlinks are called: Workingdir/big_1 (points to directory: /cache/user/me/big_zip) Workingdir/big_2 (points to file: /cache/user/other/big.zip) Workingdir/bang.zip (points to directory /cache/user/me/bang_zip) This will require hadoopStreaming to create symbolic links. Hadoop should have code to do this in a portable way. Although this may not be supported on non-Unix platforms. Cross-platform support is harder than for hard-links. Cygwin soft links are not a solution: they only work for applications compiled with cygwin1.dll) Symbolic links make JUnit tests less portable. So maybe the test should run as part of ant target test-unix. (in contrib/streaming/build.xml) The parameters after -cachearchive and -cachefile have the following properties: A. you can optionally give a name to your symlink (after #) B. the default name is the leaf name (big.zip, big.zip, bang.zip) C. if the same leaf name appears more than once you MUST give a name. Otherwise streaming client aborts and complains. For example with this, Streaming client should complain: -cachearchive dfs:/user/me/big.zip -cachefile dfs:/user/other/big.zip This complains because multiple occurrences of "big.zip" are not disambiguated with #big_1, #big_2. Ideally the Streaming client error message should then generate an example on how to fix the parameters: -cachearchive dfs:/user/me/big.zip#1 -cachefile dfs:/user/other/big.zip#2 --------- hadoop-Client note: Currently argv parsing is position-independant. i.e. changing the order of arguments never impacts the behaviour of hadoopStreaming. It would be good to keep this behaviour. URI notes: scheme is "dfs:" for consistency with current state of Hadoop code. However there is a proposal to change the scheme to "hdfs:" Using a URI fragment to give a local name to the resource is unusual. The main constraint is that the URI should remain parsable by java.net.URI(String). And encoding attributes in the fragment is standard (like CGI parameters in an HTTP GET request) (fragment is #big2 in dfs:/user/other/big.zip#big_2) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira