Re: Using addCacheArchive
To push the file to HDFS (put it in the 'a_hdfsDirectory' directory) Configuration config = new Configuration(); FileSystem hdfs = FileSystem.get(config); Path srcPath = new Path(a_directory + "/" + outputName); Path dstPath = new Path(a_hdfsDirectory + "/" + outputName); hdfs.copyFromLocalFile(srcPath, dstPath); to read it from HDFS in your mapper or reducer: Configuration config = new Configuration(); FileSystem hdfs = FileSystem.get(config); Path cachePath= new Path(a_hdfsDirectory + "/" + outputName); BufferedReader wordReader = new BufferedReader( new FileReader(cachePath.toString())); On Fri, Jun 26, 2009 at 8:55 PM, akhil1988 wrote: > > Thanks Chris for your reply! > > Well, I could not understand much of what has been discussed on that forum. > I am unaware of Cascading. > > My problem is simple - I want a directory to present in the local working > directory of tasks so that I can access it from my map task in the > following > manner : > > FileInputStream fin = new FileInputStream("Config/file1.config"); > > where, > Config is a directory which contains many files/directories, one of which > is > file1.config > > It would be helpful to me if you can tell me what statements to use to > distribute a directory to the tasktrackers. > The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html says > that archives are unzipped on the tasktrackers but I want an example of how > to use this in case of a dreictory. > > Thanks, > Akhil > > > > Chris Curtin-2 wrote: > > > > Hi, > > > > I've found it much easier to write the file to HDFS use the API, then > pass > > the 'path' to the file in HDFS as a property. You'll need to remember to > > clean up the file after you're done with it. > > > > Example details are in this thread: > > > http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6# > > > > Hope this helps, > > > > Chris > > > > On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 wrote: > > > >> > >> Please ask any questions if I am not clear above about the problem I am > >> facing. > >> > >> Thanks, > >> Akhil > >> > >> akhil1988 wrote: > >> > > >> > Hi All! > >> > > >> > I want a directory to be present in the local working directory of the > >> > task for which I am using the following statements: > >> > > >> > DistributedCache.addCacheArchive(new > URI("/home/akhil1988/Config.zip"), > >> > conf); > >> > DistributedCache.createSymlink(conf); > >> > > >> >>> Here Config is a directory which I have zipped and put at the given > >> >>> location in HDFS > >> > > >> > I have zipped the directory because the API doc of DistributedCache > >> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that > >> the > >> > archive files are unzipped in the local cache directory : > >> > > >> > DistributedCache can be used to distribute simple, read-only data/text > >> > files and/or more complex types such as archives, jars etc. Archives > >> (zip, > >> > tar and tgz/tar.gz files) are un-archived at the slave nodes. > >> > > >> > So, from my understanding of the API docs I expect that the Config.zip > >> > file will be unzipped to Config directory and since I have SymLinked > >> them > >> > I can access the directory in the following manner from my map > >> function: > >> > > >> > FileInputStream fin = new FileInputStream("Config/file1.config"); > >> > > >> > But I get the FileNotFoundException on the execution of this > statement. > >> > Please let me know where I am going wrong. > >> > > >> > Thanks, > >> > Akhil > >> > > >> > >> -- > >> View this message in context: > >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html > >> Sent from the Hadoop core-user mailing list archive at Nabble.com. > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
Re: Using addCacheArchive
Thanks Chris for your reply! Well, I could not understand much of what has been discussed on that forum. I am unaware of Cascading. My problem is simple - I want a directory to present in the local working directory of tasks so that I can access it from my map task in the following manner : FileInputStream fin = new FileInputStream("Config/file1.config"); where, Config is a directory which contains many files/directories, one of which is file1.config It would be helpful to me if you can tell me what statements to use to distribute a directory to the tasktrackers. The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html says that archives are unzipped on the tasktrackers but I want an example of how to use this in case of a dreictory. Thanks, Akhil Chris Curtin-2 wrote: > > Hi, > > I've found it much easier to write the file to HDFS use the API, then pass > the 'path' to the file in HDFS as a property. You'll need to remember to > clean up the file after you're done with it. > > Example details are in this thread: > http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6# > > Hope this helps, > > Chris > > On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 wrote: > >> >> Please ask any questions if I am not clear above about the problem I am >> facing. >> >> Thanks, >> Akhil >> >> akhil1988 wrote: >> > >> > Hi All! >> > >> > I want a directory to be present in the local working directory of the >> > task for which I am using the following statements: >> > >> > DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"), >> > conf); >> > DistributedCache.createSymlink(conf); >> > >> >>> Here Config is a directory which I have zipped and put at the given >> >>> location in HDFS >> > >> > I have zipped the directory because the API doc of DistributedCache >> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that >> the >> > archive files are unzipped in the local cache directory : >> > >> > DistributedCache can be used to distribute simple, read-only data/text >> > files and/or more complex types such as archives, jars etc. Archives >> (zip, >> > tar and tgz/tar.gz files) are un-archived at the slave nodes. >> > >> > So, from my understanding of the API docs I expect that the Config.zip >> > file will be unzipped to Config directory and since I have SymLinked >> them >> > I can access the directory in the following manner from my map >> function: >> > >> > FileInputStream fin = new FileInputStream("Config/file1.config"); >> > >> > But I get the FileNotFoundException on the execution of this statement. >> > Please let me know where I am going wrong. >> > >> > Thanks, >> > Akhil >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html >> Sent from the Hadoop core-user mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Using addCacheArchive
Hi, I've found it much easier to write the file to HDFS use the API, then pass the 'path' to the file in HDFS as a property. You'll need to remember to clean up the file after you're done with it. Example details are in this thread: http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6# Hope this helps, Chris On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 wrote: > > Please ask any questions if I am not clear above about the problem I am > facing. > > Thanks, > Akhil > > akhil1988 wrote: > > > > Hi All! > > > > I want a directory to be present in the local working directory of the > > task for which I am using the following statements: > > > > DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"), > > conf); > > DistributedCache.createSymlink(conf); > > > >>> Here Config is a directory which I have zipped and put at the given > >>> location in HDFS > > > > I have zipped the directory because the API doc of DistributedCache > > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that > the > > archive files are unzipped in the local cache directory : > > > > DistributedCache can be used to distribute simple, read-only data/text > > files and/or more complex types such as archives, jars etc. Archives > (zip, > > tar and tgz/tar.gz files) are un-archived at the slave nodes. > > > > So, from my understanding of the API docs I expect that the Config.zip > > file will be unzipped to Config directory and since I have SymLinked them > > I can access the directory in the following manner from my map function: > > > > FileInputStream fin = new FileInputStream("Config/file1.config"); > > > > But I get the FileNotFoundException on the execution of this statement. > > Please let me know where I am going wrong. > > > > Thanks, > > Akhil > > > > -- > View this message in context: > http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
Re: Using addCacheArchive
Yes, my HDFS paths are of the form /home/user-name/ And I have used these in DistributedCache's addCacheFiles method successfully. Thanks, Akhil Amareshwari Sriramadasu wrote: > > Is your hdfs path /home/akhil1988/Config.zip? Usually hdfs path is of the > form /user/akhil1988/Config.zip. > Just wondering if you are giving wrong path in the uri! > > Thanks > Amareshwari > > akhil1988 wrote: >> Thanks Amareshwari for your reply! >> >> The file Config.zip is lying in the HDFS, if it would not have been then >> the >> error would be reported by the jobtracker itself while executing the >> statement: >> DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"), >> conf); >> >> But I get error in the map function when I try to access the Config >> directory. >> >> Now I am using the following statement but still getting the same error: >> DistributedCache.addCacheArchive(new >> URI("/home/akhil1988/Config.zip#Config"), conf); >> >> Do you think whether there should be any problem in distributing a zipped >> directory and then hadoop unzipping it recursively. >> >> Thanks! >> Akhil >> >> >> >> Amareshwari Sriramadasu wrote: >> >>> Hi Akhil, >>> >>> DistributedCache.addCacheArchive takes path on hdfs. From your code, it >>> looks like you are passing local path. >>> Also, if you want to create symlink, you should pass URI as >>> hdfs://#, besides calling >>> DistributedCache.createSymlink(conf); >>> >>> Thanks >>> Amareshwari >>> >>> >>> akhil1988 wrote: >>> >>>> Please ask any questions if I am not clear above about the problem I am >>>> facing. >>>> >>>> Thanks, >>>> Akhil >>>> >>>> akhil1988 wrote: >>>> >>>> >>>>> Hi All! >>>>> >>>>> I want a directory to be present in the local working directory of the >>>>> task for which I am using the following statements: >>>>> >>>>> DistributedCache.addCacheArchive(new >>>>> URI("/home/akhil1988/Config.zip"), >>>>> conf); >>>>> DistributedCache.createSymlink(conf); >>>>> >>>>> >>>>> >>>>>>> Here Config is a directory which I have zipped and put at the given >>>>>>> location in HDFS >>>>>>> >>>>>>> >>>>> I have zipped the directory because the API doc of DistributedCache >>>>> (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that >>>>> the >>>>> archive files are unzipped in the local cache directory : >>>>> >>>>> DistributedCache can be used to distribute simple, read-only data/text >>>>> files and/or more complex types such as archives, jars etc. Archives >>>>> (zip, >>>>> tar and tgz/tar.gz files) are un-archived at the slave nodes. >>>>> >>>>> So, from my understanding of the API docs I expect that the Config.zip >>>>> file will be unzipped to Config directory and since I have SymLinked >>>>> them >>>>> I can access the directory in the following manner from my map >>>>> function: >>>>> >>>>> FileInputStream fin = new FileInputStream("Config/file1.config"); >>>>> >>>>> But I get the FileNotFoundException on the execution of this >>>>> statement. >>>>> Please let me know where I am going wrong. >>>>> >>>>> Thanks, >>>>> Akhil >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > > > -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24214730.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Using addCacheArchive
Is your hdfs path /home/akhil1988/Config.zip? Usually hdfs path is of the form /user/akhil1988/Config.zip. Just wondering if you are giving wrong path in the uri! Thanks Amareshwari akhil1988 wrote: Thanks Amareshwari for your reply! The file Config.zip is lying in the HDFS, if it would not have been then the error would be reported by the jobtracker itself while executing the statement: DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"), conf); But I get error in the map function when I try to access the Config directory. Now I am using the following statement but still getting the same error: DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip#Config"), conf); Do you think whether there should be any problem in distributing a zipped directory and then hadoop unzipping it recursively. Thanks! Akhil Amareshwari Sriramadasu wrote: Hi Akhil, DistributedCache.addCacheArchive takes path on hdfs. From your code, it looks like you are passing local path. Also, if you want to create symlink, you should pass URI as hdfs://#, besides calling DistributedCache.createSymlink(conf); Thanks Amareshwari akhil1988 wrote: Please ask any questions if I am not clear above about the problem I am facing. Thanks, Akhil akhil1988 wrote: Hi All! I want a directory to be present in the local working directory of the task for which I am using the following statements: DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"), conf); DistributedCache.createSymlink(conf); Here Config is a directory which I have zipped and put at the given location in HDFS I have zipped the directory because the API doc of DistributedCache (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the archive files are unzipped in the local cache directory : DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. So, from my understanding of the API docs I expect that the Config.zip file will be unzipped to Config directory and since I have SymLinked them I can access the directory in the following manner from my map function: FileInputStream fin = new FileInputStream("Config/file1.config"); But I get the FileNotFoundException on the execution of this statement. Please let me know where I am going wrong. Thanks, Akhil
Re: Using addCacheArchive
Thanks Amareshwari for your reply! The file Config.zip is lying in the HDFS, if it would not have been then the error would be reported by the jobtracker itself while executing the statement: DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"), conf); But I get error in the map function when I try to access the Config directory. Now I am using the following statement but still getting the same error: DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip#Config"), conf); Do you think whether there should be any problem in distributing a zipped directory and then hadoop unzipping it recursively. Thanks! Akhil Amareshwari Sriramadasu wrote: > > Hi Akhil, > > DistributedCache.addCacheArchive takes path on hdfs. From your code, it > looks like you are passing local path. > Also, if you want to create symlink, you should pass URI as > hdfs://#, besides calling > DistributedCache.createSymlink(conf); > > Thanks > Amareshwari > > > akhil1988 wrote: >> Please ask any questions if I am not clear above about the problem I am >> facing. >> >> Thanks, >> Akhil >> >> akhil1988 wrote: >> >>> Hi All! >>> >>> I want a directory to be present in the local working directory of the >>> task for which I am using the following statements: >>> >>> DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"), >>> conf); >>> DistributedCache.createSymlink(conf); >>> >>> >>>>> Here Config is a directory which I have zipped and put at the given >>>>> location in HDFS >>>>> >>> I have zipped the directory because the API doc of DistributedCache >>> (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that >>> the >>> archive files are unzipped in the local cache directory : >>> >>> DistributedCache can be used to distribute simple, read-only data/text >>> files and/or more complex types such as archives, jars etc. Archives >>> (zip, >>> tar and tgz/tar.gz files) are un-archived at the slave nodes. >>> >>> So, from my understanding of the API docs I expect that the Config.zip >>> file will be unzipped to Config directory and since I have SymLinked >>> them >>> I can access the directory in the following manner from my map function: >>> >>> FileInputStream fin = new FileInputStream("Config/file1.config"); >>> >>> But I get the FileNotFoundException on the execution of this statement. >>> Please let me know where I am going wrong. >>> >>> Thanks, >>> Akhil >>> >>> >> >> > > > -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24214657.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Using addCacheArchive
Hi Akhil, DistributedCache.addCacheArchive takes path on hdfs. From your code, it looks like you are passing local path. Also, if you want to create symlink, you should pass URI as hdfs://#, besides calling DistributedCache.createSymlink(conf); Thanks Amareshwari akhil1988 wrote: Please ask any questions if I am not clear above about the problem I am facing. Thanks, Akhil akhil1988 wrote: Hi All! I want a directory to be present in the local working directory of the task for which I am using the following statements: DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"), conf); DistributedCache.createSymlink(conf); Here Config is a directory which I have zipped and put at the given location in HDFS I have zipped the directory because the API doc of DistributedCache (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the archive files are unzipped in the local cache directory : DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. So, from my understanding of the API docs I expect that the Config.zip file will be unzipped to Config directory and since I have SymLinked them I can access the directory in the following manner from my map function: FileInputStream fin = new FileInputStream("Config/file1.config"); But I get the FileNotFoundException on the execution of this statement. Please let me know where I am going wrong. Thanks, Akhil
Re: Using addCacheArchive
Please ask any questions if I am not clear above about the problem I am facing. Thanks, Akhil akhil1988 wrote: > > Hi All! > > I want a directory to be present in the local working directory of the > task for which I am using the following statements: > > DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"), > conf); > DistributedCache.createSymlink(conf); > >>> Here Config is a directory which I have zipped and put at the given >>> location in HDFS > > I have zipped the directory because the API doc of DistributedCache > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the > archive files are unzipped in the local cache directory : > > DistributedCache can be used to distribute simple, read-only data/text > files and/or more complex types such as archives, jars etc. Archives (zip, > tar and tgz/tar.gz files) are un-archived at the slave nodes. > > So, from my understanding of the API docs I expect that the Config.zip > file will be unzipped to Config directory and since I have SymLinked them > I can access the directory in the following manner from my map function: > > FileInputStream fin = new FileInputStream("Config/file1.config"); > > But I get the FileNotFoundException on the execution of this statement. > Please let me know where I am going wrong. > > Thanks, > Akhil > -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Using addCacheArchive
Hi All! I want a directory to be present in the local working directory of the task for which I am using the following statements: DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"), conf); DistributedCache.createSymlink(conf); >> Here Config is a directory which I have zipped and put at the given >> location in HDFS I have zipped the directory because the API doc of DistributedCache (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the archive files are unzipped in the local cache directory : DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. So, from my understanding of the API docs I expect that the Config.zip file will be unzipped to Config directory and since I have SymLinked them I can access the directory in the following manner from my map function: FileInputStream fin = new FileInputStream("Config/file1.config"); But I get the FileNotFoundException on the execution of this statement. Please let me know where I am going wrong. Thanks, Akhil -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24207739.html Sent from the Hadoop core-user mailing list archive at Nabble.com.