Re: Using addCacheArchive

Chris Curtin Thu, 02 Jul 2009 05:00:47 -0700

The code below works fine in my 10 node EC2 cluster with the file being
'shared' created dynamically by a previous map/reduce job in the same flow.


I also programitically push the file, not using the hadoop fs command, so I
don't know if file paths have anything to do with it.Maybe try it without
the 'hdfs:' prefix? I don't have that anywhere.

Chris

On Thu, Jul 2, 2009 at 12:23 AM, akhil1988 <akhilan...@gmail.com> wrote:

>
> Hi Chris!
>
> Sorry for the late reply!
>
> To push the file into HDFS is clear to me and it can be done using "hadoop
> fs -put" command also (prior to executing the job), which I generally use.
>
> The method to access a file in HDFS from Mapper/Reducer is the following:
> FileSystem fs = FileSystem.get(conf);
> FSDataInputStream din = fs.open("/home/akhil1988/sample.txt");
>
> The method (below)that you gave does not work:
> Path cachePath= new Path("hdfs:///home/akhil1988/sample.txt");
> BufferedReader wordReader = new BufferedReader(new
> FileReader(cachePath.toString()));
>
> A file in HDFS cannot be accessed through these standard Java function, it
> has to be accessed only via the method I have mentioned above. The API
> methods for FileSystem class are very limited and it only provides us to
> read a data file (containing java primitives) and not any binary files.
>
> In my specific problem, I am using a API (specific to my research-domain)
> which takes a path (String) as input and reads data from this path (which
> points to a binary file). So I just need a way in which I can access files
> (from tasktrackers) as we do via standard java functions. For this, we need
> the files to be present in the local filesystem of the tasktrackers. That
> is
> why I am using DistributedCache.
>
> I hope I am clear?? And if I am wrong anywhere, please let me know.
>
> Thanks,
> Akhil
>
>
>
>
>
> The API provides only this function to read a data file(containing java
> primitives), we cannot read any binary files.
>
>
>
>
> Well, what I wanted was to have a directory in the local filesystem of the
> tasktracker and not the HDFS because of the following reason:
>
>
>
>
> Chris Curtin-2 wrote:
> >
> > To push the file to HDFS (put it in the 'a_hdfsDirectory' directory)
> >
> > Configuration config = new Configuration();
> > FileSystem hdfs = FileSystem.get(config);
> > Path srcPath = new Path(a_directory + "/" + outputName);
> > Path dstPath = new Path(a_hdfsDirectory + "/" + outputName);
> > hdfs.copyFromLocalFile(srcPath, dstPath);
> >
> >
> > to read it from HDFS in your mapper or reducer:
> >
> > Configuration config = new Configuration();
> > FileSystem hdfs = FileSystem.get(config);
> > Path cachePath= new Path(a_hdfsDirectory + "/" + outputName);
> > BufferedReader wordReader = new BufferedReader(
> >         new FileReader(cachePath.toString()));
> >
> >
> >
> > On Fri, Jun 26, 2009 at 8:55 PM, akhil1988 <akhilan...@gmail.com> wrote:
> >
> >>
> >> Thanks Chris for your reply!
> >>
> >> Well, I could not understand much of what has been discussed on that
> >> forum.
> >> I am unaware of Cascading.
> >>
> >> My problem is simple - I want a directory to present in the local
> working
> >> directory of tasks so that I can access it from my map task in the
> >> following
> >> manner :
> >>
> >> FileInputStream fin = new FileInputStream("Config/file1.config");
> >>
> >> where,
> >> Config is a directory which contains many files/directories, one of
> which
> >> is
> >> file1.config
> >>
> >> It would be helpful to me if you can tell me what statements to use to
> >> distribute a directory to the tasktrackers.
> >> The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html
> >> says
> >> that archives are unzipped on the tasktrackers but I want an example of
> >> how
> >> to use this in case of a dreictory.
> >>
> >> Thanks,
> >> Akhil
> >>
> >>
> >>
> >> Chris Curtin-2 wrote:
> >> >
> >> > Hi,
> >> >
> >> > I've found it much easier to write the file to HDFS use the API, then
> >> pass
> >> > the 'path' to the file in HDFS as a property. You'll need to remember
> >> to
> >> > clean up the file after you're done with it.
> >> >
> >> > Example details are in this thread:
> >> >
> >>
> http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#
> >> >
> >> > Hope this helps,
> >> >
> >> > Chris
> >> >
> >> > On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 <akhilan...@gmail.com>
> >> wrote:
> >> >
> >> >>
> >> >> Please ask any questions if I am not clear above about the problem I
> >> am
> >> >> facing.
> >> >>
> >> >> Thanks,
> >> >> Akhil
> >> >>
> >> >> akhil1988 wrote:
> >> >> >
> >> >> > Hi All!
> >> >> >
> >> >> > I want a directory to be present in the local working directory of
> >> the
> >> >> > task for which I am using the following statements:
> >> >> >
> >> >> > DistributedCache.addCacheArchive(new
> >> URI("/home/akhil1988/Config.zip"),
> >> >> > conf);
> >> >> > DistributedCache.createSymlink(conf);
> >> >> >
> >> >> >>> Here Config is a directory which I have zipped and put at the
> >> given
> >> >> >>> location in HDFS
> >> >> >
> >> >> > I have zipped the directory because the API doc of DistributedCache
> >> >> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says
> >> that
> >> >> the
> >> >> > archive files are unzipped in the local cache directory :
> >> >> >
> >> >> > DistributedCache can be used to distribute simple, read-only
> >> data/text
> >> >> > files and/or more complex types such as archives, jars etc.
> Archives
> >> >> (zip,
> >> >> > tar and tgz/tar.gz files) are un-archived at the slave nodes.
> >> >> >
> >> >> > So, from my understanding of the API docs I expect that the
> >> Config.zip
> >> >> > file will be unzipped to Config directory and since I have
> SymLinked
> >> >> them
> >> >> > I can access the directory in the following manner from my map
> >> >> function:
> >> >> >
> >> >> > FileInputStream fin = new FileInputStream("Config/file1.config");
> >> >> >
> >> >> > But I get the FileNotFoundException on the execution of this
> >> statement.
> >> >> > Please let me know where I am going wrong.
> >> >> >
> >> >> > Thanks,
> >> >> > Akhil
> >> >> >
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
> >> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >> >>
> >> >>
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Using-addCacheArchive-tp24207739p24300915.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: Using addCacheArchive

Reply via email to