Re: How to share files amongst multiple jobs using Distributed Cache in Hadoop 2.7.2

2016-06-13 Thread Siddharth Dawar
Hi Arun,

Thanks for your prompt reply. Actually, I want to add files to the job
running internally in

JobClient.RunJob(conf2)

method and add cache files to file.

I am unable to find a way to get the running job.

The method Job.getInstanceOf(conf) creates a new job (but I want to add
file to currently running job only).

On Tue, Jun 7, 2016 at 6:36 PM, Arun Natva <arun.na...@gmail.com> wrote:

> If you use the Instance of Job class, you can add files to distributed
> cache like this:
> Job job = Job.getInstanceOf(conf);
> job.addCacheFiles(filepath);
>
>
> Sent from my iPhone
>
> On Jun 7, 2016, at 5:17 AM, Siddharth Dawar <siddharthdawa...@gmail.com>
> wrote:
>
> Hi,
>
> I wrote a program which creates Map-Reduce jobs in an iterative fashion as
> follows:
>
>
> while (true) {
>
> JobConf conf2  = new JobConf(getConf(),graphMining.class);
> conf2.setJobName("sid");conf2.setMapperClass(mapperMiner.class);conf2.setReducerClass(reducerMiner.class);conf2.setInputFormat(SequenceFileInputFormat.class);conf2.setOutputFormat(SequenceFileOutputFormat.class);conf2.setOutputValueClass(BytesWritable.class);conf2.setMapOutputKeyClass(Text.class);conf2.setMapOutputValueClass(MapWritable.class);conf2.setOutputKeyClass(Text.class);
>
> conf2.setNumMapTasks(Integer.parseInt(args[3]));conf2.setNumReduceTasks(Integer.parseInt(args[4]));FileInputFormat.addInputPath(conf2,
>  new Path(input));FileOutputFormat.setOutputPath(conf2, new Path(output)); }
>
> RunningJob job = JobClient.runJob(conf2);
> }
>
>
> Now, I want the first Job which gets created to write something in the
> distributed cache and the jobs which get created after the first job to
> read from the distributed cache.
>
> I came to know that the DistributedCache.addcacheFiles() method is
> deprecated, so the documentation suggests to use Job.addcacheFiles() method
> specific for each job.
>
> But, I am unable to get an handle of the currently running job, as
> JobClient.runJob(conf2) submits a job internally.
>
>
> How can I share the content written by the first job in this while loop
> available via distributed cache to other jobs which get created in later
> iterations of while loop ?
>
>


Re: Accessing files in Hadoop 2.7.2 Distributed Cache

2016-06-13 Thread Siddharth Dawar
Hi Jeff,

Thanks for your prompt reply. Actually my problem is as follows:

My code creates a new job named "job 1" which writes something to
distributed cache (say a text file) and the job gets completed.

Now, I want to create some n number of jobs in while loop below, which
reads the text file written by "job 1" from the distributed cache. So my
question is, "*How to share content among multiple jobs using distributed
cache*" ?

*Another part of the problem *is that I don't know how to get instance of
running job from

JobClient.runJob(conf2);

*so that I can use job.addcachefiles(..) command/*



while (true)

{

JobConf conf2  = new JobConf(getConf(),graphMining.class);
conf2.setJobName("sid");conf2.setMapperClass(mapperMiner.class);conf2.setReducerClass(reducerMiner.class);conf2.setInputFormat(SequenceFileInputFormat.class);conf2.setOutputFormat(SequenceFileOutputFormat.class);conf2.setOutputValueClass(BytesWritable.class);conf2.setMapOutputKeyClass(Text.class);conf2.setMapOutputValueClass(MapWritable.class);conf2.setOutputKeyClass(Text.class);

conf2.setNumMapTasks(Integer.parseInt(args[3]));conf2.setNumReduceTasks(Integer.parseInt(args[4]));FileInputFormat.addInputPath(conf2,
new Path(input));FileOutputFormat.setOutputPath(conf2, new
Path(output)); }

RunningJob job = JobClient.runJob(conf2);
}




On Wed, Jun 8, 2016 at 3:50 AM, Guttadauro, Jeff <jeff.guttada...@here.com>
wrote:

> Hi, Siddharth.
>
>
>
> I was also a bit frustrated at what I found to be scant documentation on
> how to use the distributed cache in Hadoop 2.  The DistributedCache class
> itself was deprecated in Hadoop 2, but there don’t appear to be very clear
> instructions on the alternative.  I think it’s actually much simpler to
> work with files on the distributed cache in Hadoop 2.  The new way is to
> add files to the cache (or cacheArchive) via the Job object:
>
>
>
> job.addCacheFile(*uriForYourFile*)
>
> job.addCacheArchive(*uriForYourArchive*);
>
>
>
> The cool part is that, if you set up your URI so that it has a “#
> *yourFileReference*” at the end, then Hadoop will set up a symbolic link
> named “*yourFileReference*” in your job’s working directory, which you
> can use to get at the file or archive.  So, it’s as if the file or archive
> is in the working directory.  That obviates the need to even work with the
> DistributedCache class in your Mapper or Reducer, since you can just work
> with the file (or path using nio) directly.
>
>
>
> Hope that helps.
>
> -Jeff
>
> *From:* Siddharth Dawar [mailto:siddharthdawa...@gmail.com]
> *Sent:* Tuesday, June 07, 2016 4:06 AM
> *To:* user@hadoop.apache.org
> *Subject:* Accessing files in Hadoop 2.7.2 Distributed Cache
>
>
>
> Hi,
>
> I want to use the distributed cache to allow my mappers to access data in
> Hadoop 2.7.2. In main, I'm using the command
>
> String hdfs_path="hdfs://localhost:9000/bloomfilter";
>
> InputStream in = new BufferedInputStream(new 
> FileInputStream("/home/siddharth/Desktop/data/bloom_filter"));
>
> Configuration conf = new Configuration();
>
> fs = FileSystem.get(java.net.URI.create(hdfs_path), conf);
>
> OutputStream out = fs.create(new Path(hdfs_path));
>
>
>
> //Copy file from local to HDFS
>
> IOUtils.copyBytes(in, out, 4096, true);
>
>
>
> System.out.println(hdfs_path + " copied to 
> HDFS");DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2);
>
> DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2);
>
>
>
> The above code adds a file present on my local file system to HDFS and adds 
> it to the distributed cache.
>
> However, in my mapper code, when I try to access the file stored in 
> distributed cache, the Path[] P variable gets null value. d
>
>
> public void configure(JobConf conf)
>
>{
>
>this.conf = conf;
>
>try {
>
>   Path [] 
> p=DistributedCache.getLocalCacheFiles(conf);
>
>} catch (IOException e) {
>
>   // TODO Auto-generated catch block
>
>   e.printStackTrace();
>
>}
>
>
>
>
>
>
>
>}
>
> Even when I tried to access distributed cache from the following code
>
> in my mapper, the code returns the error that bloomfilter file doesn't exist
>
> strm = new DataInputStream(new FileInputStream("bloomfilter"));
>
> // Read into our Bloom filter.
>
> filter.readFields(strm);
>
> strm.close();
>
> However, I read somewhere that if we add a file to distributed cache, we can 
> access it
>
> directly from its name.
>
> Can you please help me out ?
>
>
>


How to share files amongst multiple jobs using Distributed Cache in Hadoop 2.7.2

2016-06-07 Thread Siddharth Dawar
Hi,

I wrote a program which creates Map-Reduce jobs in an iterative fashion as
follows:


while (true) {

JobConf conf2  = new JobConf(getConf(),graphMining.class);
conf2.setJobName("sid");conf2.setMapperClass(mapperMiner.class);conf2.setReducerClass(reducerMiner.class);conf2.setInputFormat(SequenceFileInputFormat.class);conf2.setOutputFormat(SequenceFileOutputFormat.class);conf2.setOutputValueClass(BytesWritable.class);conf2.setMapOutputKeyClass(Text.class);conf2.setMapOutputValueClass(MapWritable.class);conf2.setOutputKeyClass(Text.class);

conf2.setNumMapTasks(Integer.parseInt(args[3]));conf2.setNumReduceTasks(Integer.parseInt(args[4]));FileInputFormat.addInputPath(conf2,
new Path(input));FileOutputFormat.setOutputPath(conf2, new
Path(output)); }

RunningJob job = JobClient.runJob(conf2);
}


Now, I want the first Job which gets created to write something in the
distributed cache and the jobs which get created after the first job to
read from the distributed cache.

I came to know that the DistributedCache.addcacheFiles() method is
deprecated, so the documentation suggests to use Job.addcacheFiles() method
specific for each job.

But, I am unable to get an handle of the currently running job, as
JobClient.runJob(conf2) submits a job internally.


How can I share the content written by the first job in this while loop
available via distributed cache to other jobs which get created in later
iterations of while loop ?


Accessing files in Hadoop 2.7.2 Distributed Cache

2016-06-07 Thread Siddharth Dawar
Hi,

I want to use the distributed cache to allow my mappers to access data in
Hadoop 2.7.2. In main, I'm using the command

String hdfs_path="hdfs://localhost:9000/bloomfilter";InputStream in =
new BufferedInputStream(new
FileInputStream("/home/siddharth/Desktop/data/bloom_filter"));Configuration
conf = new Configuration();fs =
FileSystem.get(java.net.URI.create(hdfs_path), conf);OutputStream out
= fs.create(new Path(hdfs_path));   
 //Copy file from local to
HDFSIOUtils.copyBytes(in, out, 4096, true); 

System.out.println(hdfs_path + " copied to
HDFS");DistributedCache.addCacheFile(new Path(hdfs_path).toUri(),
conf2);

DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2);


The above code adds a file present on my local file system to HDFS and
adds it to the distributed cache.


However, in my mapper code, when I try to access the file stored in
distributed cache, the Path[] P variable gets null value. d


public void configure(JobConf conf) {   
this.conf = conf;   try
{   Path [] 
p=DistributedCache.getLocalCacheFiles(conf);} catch
(IOException e) {   // TODO Auto-generated 
catch
block   e.printStackTrace();
}   
}

Even when I tried to access distributed cache from the following code

in my mapper, the code returns the error that bloomfilter file doesn't exist

strm = new DataInputStream(new FileInputStream("bloomfilter"));// Read
into our Bloom filter.filter.readFields(strm);strm.close();

However, I read somewhere that if we add a file to distributed cache,
we can access it

directly from its name.

Can you please help me out ?