Re: Using addCacheArchive

akhil1988 Wed, 01 Jul 2009 21:24:46 -0700

Please ignore the last two lines (after Thanks)!

Akhil



akhil1988 wrote:
> 
> Hi Chris!
> 
> Sorry for the late reply!
> 
> To push the file into HDFS is clear to me and it can be done using "hadoop
> fs -put" command also (prior to executing the job), which I generally use.
> 
> The method to access a file in HDFS from Mapper/Reducer is the following:
> FileSystem fs = FileSystem.get(conf);
> FSDataInputStream din = fs.open("/home/akhil1988/sample.txt");
> 
> The method (below)that you gave does not work:
> Path cachePath= new Path("hdfs:///home/akhil1988/sample.txt");
> BufferedReader wordReader = new BufferedReader(new
> FileReader(cachePath.toString()));
> 
> A file in HDFS cannot be accessed through these standard Java function, it
> has to be accessed only via the method I have mentioned above. The API
> methods for FileSystem class are very limited and it only provides us to
> read a data file (containing java primitives) and not any binary files.
> 
> In my specific problem, I am using a API (specific to my research-domain)
> which takes a path (String) as input and reads data from this path (which
> points to a binary file). So I just need a way in which I can access files
> (from tasktrackers) as we do via standard java functions. For this, we
> need the files to be present in the local filesystem of the tasktrackers.
> That is why I am using DistributedCache. 
> 
> I hope I am clear?? And if I am wrong anywhere, please let me know.
> 
> Thanks,
> Akhil
> 
> 
> 
> 
> 
> The API provides only this function to read a data file(containing java
> primitives), we cannot read any binary files. 
> 
> 
> 
> 
> Well, what I wanted was to have a directory in the local filesystem of the
> tasktracker and not the HDFS because of the following reason:
> 
> 
> 
> 
> Chris Curtin-2 wrote:
>> 
>> To push the file to HDFS (put it in the 'a_hdfsDirectory' directory)
>> 
>> Configuration config = new Configuration();
>> FileSystem hdfs = FileSystem.get(config);
>> Path srcPath = new Path(a_directory + "/" + outputName);
>> Path dstPath = new Path(a_hdfsDirectory + "/" + outputName);
>> hdfs.copyFromLocalFile(srcPath, dstPath);
>> 
>> 
>> to read it from HDFS in your mapper or reducer:
>> 
>> Configuration config = new Configuration();
>> FileSystem hdfs = FileSystem.get(config);
>> Path cachePath= new Path(a_hdfsDirectory + "/" + outputName);
>> BufferedReader wordReader = new BufferedReader(
>>         new FileReader(cachePath.toString()));
>> 
>> 
>> 
>> On Fri, Jun 26, 2009 at 8:55 PM, akhil1988 <akhilan...@gmail.com> wrote:
>> 
>>>
>>> Thanks Chris for your reply!
>>>
>>> Well, I could not understand much of what has been discussed on that
>>> forum.
>>> I am unaware of Cascading.
>>>
>>> My problem is simple - I want a directory to present in the local
>>> working
>>> directory of tasks so that I can access it from my map task in the
>>> following
>>> manner :
>>>
>>> FileInputStream fin = new FileInputStream("Config/file1.config");
>>>
>>> where,
>>> Config is a directory which contains many files/directories, one of
>>> which
>>> is
>>> file1.config
>>>
>>> It would be helpful to me if you can tell me what statements to use to
>>> distribute a directory to the tasktrackers.
>>> The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html
>>> says
>>> that archives are unzipped on the tasktrackers but I want an example of
>>> how
>>> to use this in case of a dreictory.
>>>
>>> Thanks,
>>> Akhil
>>>
>>>
>>>
>>> Chris Curtin-2 wrote:
>>> >
>>> > Hi,
>>> >
>>> > I've found it much easier to write the file to HDFS use the API, then
>>> pass
>>> > the 'path' to the file in HDFS as a property. You'll need to remember
>>> to
>>> > clean up the file after you're done with it.
>>> >
>>> > Example details are in this thread:
>>> >
>>> http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#
>>> >
>>> > Hope this helps,
>>> >
>>> > Chris
>>> >
>>> > On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 <akhilan...@gmail.com>
>>> wrote:
>>> >
>>> >>
>>> >> Please ask any questions if I am not clear above about the problem I
>>> am
>>> >> facing.
>>> >>
>>> >> Thanks,
>>> >> Akhil
>>> >>
>>> >> akhil1988 wrote:
>>> >> >
>>> >> > Hi All!
>>> >> >
>>> >> > I want a directory to be present in the local working directory of
>>> the
>>> >> > task for which I am using the following statements:
>>> >> >
>>> >> > DistributedCache.addCacheArchive(new
>>> URI("/home/akhil1988/Config.zip"),
>>> >> > conf);
>>> >> > DistributedCache.createSymlink(conf);
>>> >> >
>>> >> >>> Here Config is a directory which I have zipped and put at the
>>> given
>>> >> >>> location in HDFS
>>> >> >
>>> >> > I have zipped the directory because the API doc of DistributedCache
>>> >> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says
>>> that
>>> >> the
>>> >> > archive files are unzipped in the local cache directory :
>>> >> >
>>> >> > DistributedCache can be used to distribute simple, read-only
>>> data/text
>>> >> > files and/or more complex types such as archives, jars etc.
>>> Archives
>>> >> (zip,
>>> >> > tar and tgz/tar.gz files) are un-archived at the slave nodes.
>>> >> >
>>> >> > So, from my understanding of the API docs I expect that the
>>> Config.zip
>>> >> > file will be unzipped to Config directory and since I have
>>> SymLinked
>>> >> them
>>> >> > I can access the directory in the following manner from my map
>>> >> function:
>>> >> >
>>> >> > FileInputStream fin = new FileInputStream("Config/file1.config");
>>> >> >
>>> >> > But I get the FileNotFoundException on the execution of this
>>> statement.
>>> >> > Please let me know where I am going wrong.
>>> >> >
>>> >> > Thanks,
>>> >> > Akhil
>>> >> >
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
>>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>> >>
>>> >>
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>>
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24300929.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Using addCacheArchive

Reply via email to