The code below works fine in my 10 node EC2 cluster with the file being 'shared' created dynamically by a previous map/reduce job in the same flow.
I also programitically push the file, not using the hadoop fs command, so I don't know if file paths have anything to do with it.Maybe try it without the 'hdfs:' prefix? I don't have that anywhere. Chris On Thu, Jul 2, 2009 at 12:23 AM, akhil1988 <akhilan...@gmail.com> wrote: > > Hi Chris! > > Sorry for the late reply! > > To push the file into HDFS is clear to me and it can be done using "hadoop > fs -put" command also (prior to executing the job), which I generally use. > > The method to access a file in HDFS from Mapper/Reducer is the following: > FileSystem fs = FileSystem.get(conf); > FSDataInputStream din = fs.open("/home/akhil1988/sample.txt"); > > The method (below)that you gave does not work: > Path cachePath= new Path("hdfs:///home/akhil1988/sample.txt"); > BufferedReader wordReader = new BufferedReader(new > FileReader(cachePath.toString())); > > A file in HDFS cannot be accessed through these standard Java function, it > has to be accessed only via the method I have mentioned above. The API > methods for FileSystem class are very limited and it only provides us to > read a data file (containing java primitives) and not any binary files. > > In my specific problem, I am using a API (specific to my research-domain) > which takes a path (String) as input and reads data from this path (which > points to a binary file). So I just need a way in which I can access files > (from tasktrackers) as we do via standard java functions. For this, we need > the files to be present in the local filesystem of the tasktrackers. That > is > why I am using DistributedCache. > > I hope I am clear?? And if I am wrong anywhere, please let me know. > > Thanks, > Akhil > > > > > > The API provides only this function to read a data file(containing java > primitives), we cannot read any binary files. > > > > > Well, what I wanted was to have a directory in the local filesystem of the > tasktracker and not the HDFS because of the following reason: > > > > > Chris Curtin-2 wrote: > > > > To push the file to HDFS (put it in the 'a_hdfsDirectory' directory) > > > > Configuration config = new Configuration(); > > FileSystem hdfs = FileSystem.get(config); > > Path srcPath = new Path(a_directory + "/" + outputName); > > Path dstPath = new Path(a_hdfsDirectory + "/" + outputName); > > hdfs.copyFromLocalFile(srcPath, dstPath); > > > > > > to read it from HDFS in your mapper or reducer: > > > > Configuration config = new Configuration(); > > FileSystem hdfs = FileSystem.get(config); > > Path cachePath= new Path(a_hdfsDirectory + "/" + outputName); > > BufferedReader wordReader = new BufferedReader( > > new FileReader(cachePath.toString())); > > > > > > > > On Fri, Jun 26, 2009 at 8:55 PM, akhil1988 <akhilan...@gmail.com> wrote: > > > >> > >> Thanks Chris for your reply! > >> > >> Well, I could not understand much of what has been discussed on that > >> forum. > >> I am unaware of Cascading. > >> > >> My problem is simple - I want a directory to present in the local > working > >> directory of tasks so that I can access it from my map task in the > >> following > >> manner : > >> > >> FileInputStream fin = new FileInputStream("Config/file1.config"); > >> > >> where, > >> Config is a directory which contains many files/directories, one of > which > >> is > >> file1.config > >> > >> It would be helpful to me if you can tell me what statements to use to > >> distribute a directory to the tasktrackers. > >> The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html > >> says > >> that archives are unzipped on the tasktrackers but I want an example of > >> how > >> to use this in case of a dreictory. > >> > >> Thanks, > >> Akhil > >> > >> > >> > >> Chris Curtin-2 wrote: > >> > > >> > Hi, > >> > > >> > I've found it much easier to write the file to HDFS use the API, then > >> pass > >> > the 'path' to the file in HDFS as a property. You'll need to remember > >> to > >> > clean up the file after you're done with it. > >> > > >> > Example details are in this thread: > >> > > >> > http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6# > >> > > >> > Hope this helps, > >> > > >> > Chris > >> > > >> > On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 <akhilan...@gmail.com> > >> wrote: > >> > > >> >> > >> >> Please ask any questions if I am not clear above about the problem I > >> am > >> >> facing. > >> >> > >> >> Thanks, > >> >> Akhil > >> >> > >> >> akhil1988 wrote: > >> >> > > >> >> > Hi All! > >> >> > > >> >> > I want a directory to be present in the local working directory of > >> the > >> >> > task for which I am using the following statements: > >> >> > > >> >> > DistributedCache.addCacheArchive(new > >> URI("/home/akhil1988/Config.zip"), > >> >> > conf); > >> >> > DistributedCache.createSymlink(conf); > >> >> > > >> >> >>> Here Config is a directory which I have zipped and put at the > >> given > >> >> >>> location in HDFS > >> >> > > >> >> > I have zipped the directory because the API doc of DistributedCache > >> >> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says > >> that > >> >> the > >> >> > archive files are unzipped in the local cache directory : > >> >> > > >> >> > DistributedCache can be used to distribute simple, read-only > >> data/text > >> >> > files and/or more complex types such as archives, jars etc. > Archives > >> >> (zip, > >> >> > tar and tgz/tar.gz files) are un-archived at the slave nodes. > >> >> > > >> >> > So, from my understanding of the API docs I expect that the > >> Config.zip > >> >> > file will be unzipped to Config directory and since I have > SymLinked > >> >> them > >> >> > I can access the directory in the following manner from my map > >> >> function: > >> >> > > >> >> > FileInputStream fin = new FileInputStream("Config/file1.config"); > >> >> > > >> >> > But I get the FileNotFoundException on the execution of this > >> statement. > >> >> > Please let me know where I am going wrong. > >> >> > > >> >> > Thanks, > >> >> > Akhil > >> >> > > >> >> > >> >> -- > >> >> View this message in context: > >> >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html > >> >> Sent from the Hadoop core-user mailing list archive at Nabble.com. > >> >> > >> >> > >> > > >> > > >> > >> -- > >> View this message in context: > >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html > >> Sent from the Hadoop core-user mailing list archive at Nabble.com. > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/Using-addCacheArchive-tp24207739p24300915.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >