Re: Using addCacheArchive

2009-06-29 Thread Chris Curtin
To push the file to HDFS (put it in the 'a_hdfsDirectory' directory)

Configuration config = new Configuration();
FileSystem hdfs = FileSystem.get(config);
Path srcPath = new Path(a_directory + "/" + outputName);
Path dstPath = new Path(a_hdfsDirectory + "/" + outputName);
hdfs.copyFromLocalFile(srcPath, dstPath);


to read it from HDFS in your mapper or reducer:

Configuration config = new Configuration();
FileSystem hdfs = FileSystem.get(config);
Path cachePath= new Path(a_hdfsDirectory + "/" + outputName);
BufferedReader wordReader = new BufferedReader(
new FileReader(cachePath.toString()));



On Fri, Jun 26, 2009 at 8:55 PM, akhil1988  wrote:

>
> Thanks Chris for your reply!
>
> Well, I could not understand much of what has been discussed on that forum.
> I am unaware of Cascading.
>
> My problem is simple - I want a directory to present in the local working
> directory of tasks so that I can access it from my map task in the
> following
> manner :
>
> FileInputStream fin = new FileInputStream("Config/file1.config");
>
> where,
> Config is a directory which contains many files/directories, one of which
> is
> file1.config
>
> It would be helpful to me if you can tell me what statements to use to
> distribute a directory to the tasktrackers.
> The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html says
> that archives are unzipped on the tasktrackers but I want an example of how
> to use this in case of a dreictory.
>
> Thanks,
> Akhil
>
>
>
> Chris Curtin-2 wrote:
> >
> > Hi,
> >
> > I've found it much easier to write the file to HDFS use the API, then
> pass
> > the 'path' to the file in HDFS as a property. You'll need to remember to
> > clean up the file after you're done with it.
> >
> > Example details are in this thread:
> >
> http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#
> >
> > Hope this helps,
> >
> > Chris
> >
> > On Thu, Jun 25, 2009 at 4:50 PM, akhil1988  wrote:
> >
> >>
> >> Please ask any questions if I am not clear above about the problem I am
> >> facing.
> >>
> >> Thanks,
> >> Akhil
> >>
> >> akhil1988 wrote:
> >> >
> >> > Hi All!
> >> >
> >> > I want a directory to be present in the local working directory of the
> >> > task for which I am using the following statements:
> >> >
> >> > DistributedCache.addCacheArchive(new
> URI("/home/akhil1988/Config.zip"),
> >> > conf);
> >> > DistributedCache.createSymlink(conf);
> >> >
> >> >>> Here Config is a directory which I have zipped and put at the given
> >> >>> location in HDFS
> >> >
> >> > I have zipped the directory because the API doc of DistributedCache
> >> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
> >> the
> >> > archive files are unzipped in the local cache directory :
> >> >
> >> > DistributedCache can be used to distribute simple, read-only data/text
> >> > files and/or more complex types such as archives, jars etc. Archives
> >> (zip,
> >> > tar and tgz/tar.gz files) are un-archived at the slave nodes.
> >> >
> >> > So, from my understanding of the API docs I expect that the Config.zip
> >> > file will be unzipped to Config directory and since I have SymLinked
> >> them
> >> > I can access the directory in the following manner from my map
> >> function:
> >> >
> >> > FileInputStream fin = new FileInputStream("Config/file1.config");
> >> >
> >> > But I get the FileNotFoundException on the execution of this
> statement.
> >> > Please let me know where I am going wrong.
> >> >
> >> > Thanks,
> >> > Akhil
> >> >
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: Using addCacheArchive

2009-06-26 Thread akhil1988

Thanks Chris for your reply!

Well, I could not understand much of what has been discussed on that forum.
I am unaware of Cascading.

My problem is simple - I want a directory to present in the local working
directory of tasks so that I can access it from my map task in the following
manner :

FileInputStream fin = new FileInputStream("Config/file1.config"); 

where,
Config is a directory which contains many files/directories, one of which is
file1.config

It would be helpful to me if you can tell me what statements to use to
distribute a directory to the tasktrackers.
The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html says
that archives are unzipped on the tasktrackers but I want an example of how
to use this in case of a dreictory.

Thanks,
Akhil



Chris Curtin-2 wrote:
> 
> Hi,
> 
> I've found it much easier to write the file to HDFS use the API, then pass
> the 'path' to the file in HDFS as a property. You'll need to remember to
> clean up the file after you're done with it.
> 
> Example details are in this thread:
> http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#
> 
> Hope this helps,
> 
> Chris
> 
> On Thu, Jun 25, 2009 at 4:50 PM, akhil1988  wrote:
> 
>>
>> Please ask any questions if I am not clear above about the problem I am
>> facing.
>>
>> Thanks,
>> Akhil
>>
>> akhil1988 wrote:
>> >
>> > Hi All!
>> >
>> > I want a directory to be present in the local working directory of the
>> > task for which I am using the following statements:
>> >
>> > DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
>> > conf);
>> > DistributedCache.createSymlink(conf);
>> >
>> >>> Here Config is a directory which I have zipped and put at the given
>> >>> location in HDFS
>> >
>> > I have zipped the directory because the API doc of DistributedCache
>> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
>> the
>> > archive files are unzipped in the local cache directory :
>> >
>> > DistributedCache can be used to distribute simple, read-only data/text
>> > files and/or more complex types such as archives, jars etc. Archives
>> (zip,
>> > tar and tgz/tar.gz files) are un-archived at the slave nodes.
>> >
>> > So, from my understanding of the API docs I expect that the Config.zip
>> > file will be unzipped to Config directory and since I have SymLinked
>> them
>> > I can access the directory in the following manner from my map
>> function:
>> >
>> > FileInputStream fin = new FileInputStream("Config/file1.config");
>> >
>> > But I get the FileNotFoundException on the execution of this statement.
>> > Please let me know where I am going wrong.
>> >
>> > Thanks,
>> > Akhil
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Using addCacheArchive

2009-06-26 Thread Chris Curtin
Hi,

I've found it much easier to write the file to HDFS use the API, then pass
the 'path' to the file in HDFS as a property. You'll need to remember to
clean up the file after you're done with it.

Example details are in this thread:
http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#

Hope this helps,

Chris

On Thu, Jun 25, 2009 at 4:50 PM, akhil1988  wrote:

>
> Please ask any questions if I am not clear above about the problem I am
> facing.
>
> Thanks,
> Akhil
>
> akhil1988 wrote:
> >
> > Hi All!
> >
> > I want a directory to be present in the local working directory of the
> > task for which I am using the following statements:
> >
> > DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
> > conf);
> > DistributedCache.createSymlink(conf);
> >
> >>> Here Config is a directory which I have zipped and put at the given
> >>> location in HDFS
> >
> > I have zipped the directory because the API doc of DistributedCache
> > (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
> the
> > archive files are unzipped in the local cache directory :
> >
> > DistributedCache can be used to distribute simple, read-only data/text
> > files and/or more complex types such as archives, jars etc. Archives
> (zip,
> > tar and tgz/tar.gz files) are un-archived at the slave nodes.
> >
> > So, from my understanding of the API docs I expect that the Config.zip
> > file will be unzipped to Config directory and since I have SymLinked them
> > I can access the directory in the following manner from my map function:
> >
> > FileInputStream fin = new FileInputStream("Config/file1.config");
> >
> > But I get the FileNotFoundException on the execution of this statement.
> > Please let me know where I am going wrong.
> >
> > Thanks,
> > Akhil
> >
>
> --
> View this message in context:
> http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: Using addCacheArchive

2009-06-25 Thread akhil1988

Yes, my HDFS paths are of the form /home/user-name/
And I have used these in DistributedCache's addCacheFiles method
successfully. 

Thanks,
Akhil



Amareshwari Sriramadasu wrote:
> 
> Is your hdfs path /home/akhil1988/Config.zip? Usually hdfs path is of the
> form /user/akhil1988/Config.zip.
> Just wondering if you are giving wrong path in the uri!
> 
> Thanks
> Amareshwari
> 
> akhil1988 wrote:
>> Thanks Amareshwari for your reply!
>>
>> The file Config.zip is lying in the HDFS, if it would not have been then
>> the
>> error would be reported by the jobtracker itself while executing the
>> statement:
>> DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
>> conf);
>>
>> But I get error in the map function when I try to access the Config
>> directory. 
>>
>> Now I am using the following statement but still getting the same error: 
>> DistributedCache.addCacheArchive(new
>> URI("/home/akhil1988/Config.zip#Config"), conf);
>>
>> Do you think whether there should be any problem in distributing a zipped
>> directory and then hadoop unzipping it recursively.
>>
>> Thanks!
>> Akhil
>>
>>
>>
>> Amareshwari Sriramadasu wrote:
>>   
>>> Hi Akhil,
>>>
>>> DistributedCache.addCacheArchive takes path on hdfs. From your code, it
>>> looks like you are passing local path.
>>> Also, if you want to create symlink, you should pass URI as
>>> hdfs://#, besides calling  
>>> DistributedCache.createSymlink(conf);
>>>
>>> Thanks
>>> Amareshwari
>>>
>>>
>>> akhil1988 wrote:
>>> 
>>>> Please ask any questions if I am not clear above about the problem I am
>>>> facing.
>>>>
>>>> Thanks,
>>>> Akhil
>>>>
>>>> akhil1988 wrote:
>>>>   
>>>>   
>>>>> Hi All!
>>>>>
>>>>> I want a directory to be present in the local working directory of the
>>>>> task for which I am using the following statements: 
>>>>>
>>>>> DistributedCache.addCacheArchive(new
>>>>> URI("/home/akhil1988/Config.zip"),
>>>>> conf);
>>>>> DistributedCache.createSymlink(conf);
>>>>>
>>>>> 
>>>>> 
>>>>>>> Here Config is a directory which I have zipped and put at the given
>>>>>>> location in HDFS
>>>>>>> 
>>>>>>> 
>>>>> I have zipped the directory because the API doc of DistributedCache
>>>>> (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
>>>>> the
>>>>> archive files are unzipped in the local cache directory :
>>>>>
>>>>> DistributedCache can be used to distribute simple, read-only data/text
>>>>> files and/or more complex types such as archives, jars etc. Archives
>>>>> (zip,
>>>>> tar and tgz/tar.gz files) are un-archived at the slave nodes.
>>>>>
>>>>> So, from my understanding of the API docs I expect that the Config.zip
>>>>> file will be unzipped to Config directory and since I have SymLinked
>>>>> them
>>>>> I can access the directory in the following manner from my map
>>>>> function:
>>>>>
>>>>> FileInputStream fin = new FileInputStream("Config/file1.config");
>>>>>
>>>>> But I get the FileNotFoundException on the execution of this
>>>>> statement.
>>>>> Please let me know where I am going wrong.
>>>>>
>>>>> Thanks,
>>>>> Akhil
>>>>>
>>>>> 
>>>>> 
>>>>   
>>>>   
>>>
>>> 
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24214730.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Using addCacheArchive

2009-06-25 Thread Amareshwari Sriramadasu

Is your hdfs path /home/akhil1988/Config.zip? Usually hdfs path is of the form 
/user/akhil1988/Config.zip.
Just wondering if you are giving wrong path in the uri!

Thanks
Amareshwari

akhil1988 wrote:

Thanks Amareshwari for your reply!

The file Config.zip is lying in the HDFS, if it would not have been then the
error would be reported by the jobtracker itself while executing the
statement:
DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
conf);

But I get error in the map function when I try to access the Config
directory. 

Now I am using the following statement but still getting the same error: 
DistributedCache.addCacheArchive(new

URI("/home/akhil1988/Config.zip#Config"), conf);

Do you think whether there should be any problem in distributing a zipped
directory and then hadoop unzipping it recursively.

Thanks!
Akhil



Amareshwari Sriramadasu wrote:
  

Hi Akhil,

DistributedCache.addCacheArchive takes path on hdfs. From your code, it
looks like you are passing local path.
Also, if you want to create symlink, you should pass URI as
hdfs://#, besides calling  
DistributedCache.createSymlink(conf);


Thanks
Amareshwari


akhil1988 wrote:


Please ask any questions if I am not clear above about the problem I am
facing.

Thanks,
Akhil

akhil1988 wrote:
  
  

Hi All!

I want a directory to be present in the local working directory of the
task for which I am using the following statements: 


DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
conf);
DistributedCache.createSymlink(conf);




Here Config is a directory which I have zipped and put at the given
location in HDFS



I have zipped the directory because the API doc of DistributedCache
(http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
the
archive files are unzipped in the local cache directory :

DistributedCache can be used to distribute simple, read-only data/text
files and/or more complex types such as archives, jars etc. Archives
(zip,
tar and tgz/tar.gz files) are un-archived at the slave nodes.

So, from my understanding of the API docs I expect that the Config.zip
file will be unzipped to Config directory and since I have SymLinked
them
I can access the directory in the following manner from my map function:

FileInputStream fin = new FileInputStream("Config/file1.config");

But I get the FileNotFoundException on the execution of this statement.
Please let me know where I am going wrong.

Thanks,
Akhil



  
  





  




Re: Using addCacheArchive

2009-06-25 Thread akhil1988

Thanks Amareshwari for your reply!

The file Config.zip is lying in the HDFS, if it would not have been then the
error would be reported by the jobtracker itself while executing the
statement:
DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
conf);

But I get error in the map function when I try to access the Config
directory. 

Now I am using the following statement but still getting the same error: 
DistributedCache.addCacheArchive(new
URI("/home/akhil1988/Config.zip#Config"), conf);

Do you think whether there should be any problem in distributing a zipped
directory and then hadoop unzipping it recursively.

Thanks!
Akhil



Amareshwari Sriramadasu wrote:
> 
> Hi Akhil,
> 
> DistributedCache.addCacheArchive takes path on hdfs. From your code, it
> looks like you are passing local path.
> Also, if you want to create symlink, you should pass URI as
> hdfs://#, besides calling  
> DistributedCache.createSymlink(conf);
> 
> Thanks
> Amareshwari
> 
> 
> akhil1988 wrote:
>> Please ask any questions if I am not clear above about the problem I am
>> facing.
>>
>> Thanks,
>> Akhil
>>
>> akhil1988 wrote:
>>   
>>> Hi All!
>>>
>>> I want a directory to be present in the local working directory of the
>>> task for which I am using the following statements: 
>>>
>>> DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
>>> conf);
>>> DistributedCache.createSymlink(conf);
>>>
>>> 
>>>>> Here Config is a directory which I have zipped and put at the given
>>>>> location in HDFS
>>>>> 
>>> I have zipped the directory because the API doc of DistributedCache
>>> (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
>>> the
>>> archive files are unzipped in the local cache directory :
>>>
>>> DistributedCache can be used to distribute simple, read-only data/text
>>> files and/or more complex types such as archives, jars etc. Archives
>>> (zip,
>>> tar and tgz/tar.gz files) are un-archived at the slave nodes.
>>>
>>> So, from my understanding of the API docs I expect that the Config.zip
>>> file will be unzipped to Config directory and since I have SymLinked
>>> them
>>> I can access the directory in the following manner from my map function:
>>>
>>> FileInputStream fin = new FileInputStream("Config/file1.config");
>>>
>>> But I get the FileNotFoundException on the execution of this statement.
>>> Please let me know where I am going wrong.
>>>
>>> Thanks,
>>> Akhil
>>>
>>> 
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24214657.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Using addCacheArchive

2009-06-25 Thread Amareshwari Sriramadasu

Hi Akhil,

DistributedCache.addCacheArchive takes path on hdfs. From your code, it looks 
like you are passing local path.
Also, if you want to create symlink, you should pass URI as hdfs://#, besides calling  
DistributedCache.createSymlink(conf);


Thanks
Amareshwari


akhil1988 wrote:

Please ask any questions if I am not clear above about the problem I am
facing.

Thanks,
Akhil

akhil1988 wrote:
  

Hi All!

I want a directory to be present in the local working directory of the
task for which I am using the following statements: 


DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
conf);
DistributedCache.createSymlink(conf);



Here Config is a directory which I have zipped and put at the given
location in HDFS


I have zipped the directory because the API doc of DistributedCache
(http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the
archive files are unzipped in the local cache directory :

DistributedCache can be used to distribute simple, read-only data/text
files and/or more complex types such as archives, jars etc. Archives (zip,
tar and tgz/tar.gz files) are un-archived at the slave nodes.

So, from my understanding of the API docs I expect that the Config.zip
file will be unzipped to Config directory and since I have SymLinked them
I can access the directory in the following manner from my map function:

FileInputStream fin = new FileInputStream("Config/file1.config");

But I get the FileNotFoundException on the execution of this statement.
Please let me know where I am going wrong.

Thanks,
Akhil




  




Re: Using addCacheArchive

2009-06-25 Thread akhil1988

Please ask any questions if I am not clear above about the problem I am
facing.

Thanks,
Akhil

akhil1988 wrote:
> 
> Hi All!
> 
> I want a directory to be present in the local working directory of the
> task for which I am using the following statements: 
> 
> DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
> conf);
> DistributedCache.createSymlink(conf);
> 
>>> Here Config is a directory which I have zipped and put at the given
>>> location in HDFS
> 
> I have zipped the directory because the API doc of DistributedCache
> (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the
> archive files are unzipped in the local cache directory :
> 
> DistributedCache can be used to distribute simple, read-only data/text
> files and/or more complex types such as archives, jars etc. Archives (zip,
> tar and tgz/tar.gz files) are un-archived at the slave nodes.
> 
> So, from my understanding of the API docs I expect that the Config.zip
> file will be unzipped to Config directory and since I have SymLinked them
> I can access the directory in the following manner from my map function:
> 
> FileInputStream fin = new FileInputStream("Config/file1.config");
> 
> But I get the FileNotFoundException on the execution of this statement.
> Please let me know where I am going wrong.
> 
> Thanks,
> Akhil
> 

-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Using addCacheArchive

2009-06-25 Thread akhil1988

Hi All!

I want a directory to be present in the local working directory of the task
for which I am using the following statements: 

DistributedCache.addCacheArchive(new URI("/home/akhil1988/Config.zip"),
conf);
DistributedCache.createSymlink(conf);

>> Here Config is a directory which I have zipped and put at the given
>> location in HDFS

I have zipped the directory because the API doc of DistributedCache
(http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the
archive files are unzipped in the local cache directory :

DistributedCache can be used to distribute simple, read-only data/text files
and/or more complex types such as archives, jars etc. Archives (zip, tar and
tgz/tar.gz files) are un-archived at the slave nodes.

So, from my understanding of the API docs I expect that the Config.zip file
will be unzipped to Config directory and since I have SymLinked them I can
access the directory in the following manner from my map function:

FileInputStream fin = new FileInputStream("Config/file1.config");

But I get the FileNotFoundException on the execution of this statement.
Please let me know where I am going wrong.

Thanks,
Akhil
-- 
View this message in context: 
http://www.nabble.com/Using-addCacheArchive-tp24207739p24207739.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.