Re: How to share files amongst multiple jobs using Distributed Cache in Hadoop 2.7.2
Hi Arun, Thanks for your prompt reply. Actually, I want to add files to the job running internally in JobClient.RunJob(conf2) method and add cache files to file. I am unable to find a way to get the running job. The method Job.getInstanceOf(conf) creates a new job (but I want to add file to currently running job only). On Tue, Jun 7, 2016 at 6:36 PM, Arun Natva <arun.na...@gmail.com> wrote: > If you use the Instance of Job class, you can add files to distributed > cache like this: > Job job = Job.getInstanceOf(conf); > job.addCacheFiles(filepath); > > > Sent from my iPhone > > On Jun 7, 2016, at 5:17 AM, Siddharth Dawar <siddharthdawa...@gmail.com> > wrote: > > Hi, > > I wrote a program which creates Map-Reduce jobs in an iterative fashion as > follows: > > > while (true) { > > JobConf conf2 = new JobConf(getConf(),graphMining.class); > conf2.setJobName("sid");conf2.setMapperClass(mapperMiner.class);conf2.setReducerClass(reducerMiner.class);conf2.setInputFormat(SequenceFileInputFormat.class);conf2.setOutputFormat(SequenceFileOutputFormat.class);conf2.setOutputValueClass(BytesWritable.class);conf2.setMapOutputKeyClass(Text.class);conf2.setMapOutputValueClass(MapWritable.class);conf2.setOutputKeyClass(Text.class); > > conf2.setNumMapTasks(Integer.parseInt(args[3]));conf2.setNumReduceTasks(Integer.parseInt(args[4]));FileInputFormat.addInputPath(conf2, > new Path(input));FileOutputFormat.setOutputPath(conf2, new Path(output)); } > > RunningJob job = JobClient.runJob(conf2); > } > > > Now, I want the first Job which gets created to write something in the > distributed cache and the jobs which get created after the first job to > read from the distributed cache. > > I came to know that the DistributedCache.addcacheFiles() method is > deprecated, so the documentation suggests to use Job.addcacheFiles() method > specific for each job. > > But, I am unable to get an handle of the currently running job, as > JobClient.runJob(conf2) submits a job internally. > > > How can I share the content written by the first job in this while loop > available via distributed cache to other jobs which get created in later > iterations of while loop ? > >
Re: Accessing files in Hadoop 2.7.2 Distributed Cache
Hi Jeff, Thanks for your prompt reply. Actually my problem is as follows: My code creates a new job named "job 1" which writes something to distributed cache (say a text file) and the job gets completed. Now, I want to create some n number of jobs in while loop below, which reads the text file written by "job 1" from the distributed cache. So my question is, "*How to share content among multiple jobs using distributed cache*" ? *Another part of the problem *is that I don't know how to get instance of running job from JobClient.runJob(conf2); *so that I can use job.addcachefiles(..) command/* while (true) { JobConf conf2 = new JobConf(getConf(),graphMining.class); conf2.setJobName("sid");conf2.setMapperClass(mapperMiner.class);conf2.setReducerClass(reducerMiner.class);conf2.setInputFormat(SequenceFileInputFormat.class);conf2.setOutputFormat(SequenceFileOutputFormat.class);conf2.setOutputValueClass(BytesWritable.class);conf2.setMapOutputKeyClass(Text.class);conf2.setMapOutputValueClass(MapWritable.class);conf2.setOutputKeyClass(Text.class); conf2.setNumMapTasks(Integer.parseInt(args[3]));conf2.setNumReduceTasks(Integer.parseInt(args[4]));FileInputFormat.addInputPath(conf2, new Path(input));FileOutputFormat.setOutputPath(conf2, new Path(output)); } RunningJob job = JobClient.runJob(conf2); } On Wed, Jun 8, 2016 at 3:50 AM, Guttadauro, Jeff <jeff.guttada...@here.com> wrote: > Hi, Siddharth. > > > > I was also a bit frustrated at what I found to be scant documentation on > how to use the distributed cache in Hadoop 2. The DistributedCache class > itself was deprecated in Hadoop 2, but there don’t appear to be very clear > instructions on the alternative. I think it’s actually much simpler to > work with files on the distributed cache in Hadoop 2. The new way is to > add files to the cache (or cacheArchive) via the Job object: > > > > job.addCacheFile(*uriForYourFile*) > > job.addCacheArchive(*uriForYourArchive*); > > > > The cool part is that, if you set up your URI so that it has a “# > *yourFileReference*” at the end, then Hadoop will set up a symbolic link > named “*yourFileReference*” in your job’s working directory, which you > can use to get at the file or archive. So, it’s as if the file or archive > is in the working directory. That obviates the need to even work with the > DistributedCache class in your Mapper or Reducer, since you can just work > with the file (or path using nio) directly. > > > > Hope that helps. > > -Jeff > > *From:* Siddharth Dawar [mailto:siddharthdawa...@gmail.com] > *Sent:* Tuesday, June 07, 2016 4:06 AM > *To:* user@hadoop.apache.org > *Subject:* Accessing files in Hadoop 2.7.2 Distributed Cache > > > > Hi, > > I want to use the distributed cache to allow my mappers to access data in > Hadoop 2.7.2. In main, I'm using the command > > String hdfs_path="hdfs://localhost:9000/bloomfilter"; > > InputStream in = new BufferedInputStream(new > FileInputStream("/home/siddharth/Desktop/data/bloom_filter")); > > Configuration conf = new Configuration(); > > fs = FileSystem.get(java.net.URI.create(hdfs_path), conf); > > OutputStream out = fs.create(new Path(hdfs_path)); > > > > //Copy file from local to HDFS > > IOUtils.copyBytes(in, out, 4096, true); > > > > System.out.println(hdfs_path + " copied to > HDFS");DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2); > > DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2); > > > > The above code adds a file present on my local file system to HDFS and adds > it to the distributed cache. > > However, in my mapper code, when I try to access the file stored in > distributed cache, the Path[] P variable gets null value. d > > > public void configure(JobConf conf) > >{ > >this.conf = conf; > >try { > > Path [] > p=DistributedCache.getLocalCacheFiles(conf); > >} catch (IOException e) { > > // TODO Auto-generated catch block > > e.printStackTrace(); > >} > > > > > > > >} > > Even when I tried to access distributed cache from the following code > > in my mapper, the code returns the error that bloomfilter file doesn't exist > > strm = new DataInputStream(new FileInputStream("bloomfilter")); > > // Read into our Bloom filter. > > filter.readFields(strm); > > strm.close(); > > However, I read somewhere that if we add a file to distributed cache, we can > access it > > directly from its name. > > Can you please help me out ? > > >
How to share files amongst multiple jobs using Distributed Cache in Hadoop 2.7.2
Hi, I wrote a program which creates Map-Reduce jobs in an iterative fashion as follows: while (true) { JobConf conf2 = new JobConf(getConf(),graphMining.class); conf2.setJobName("sid");conf2.setMapperClass(mapperMiner.class);conf2.setReducerClass(reducerMiner.class);conf2.setInputFormat(SequenceFileInputFormat.class);conf2.setOutputFormat(SequenceFileOutputFormat.class);conf2.setOutputValueClass(BytesWritable.class);conf2.setMapOutputKeyClass(Text.class);conf2.setMapOutputValueClass(MapWritable.class);conf2.setOutputKeyClass(Text.class); conf2.setNumMapTasks(Integer.parseInt(args[3]));conf2.setNumReduceTasks(Integer.parseInt(args[4]));FileInputFormat.addInputPath(conf2, new Path(input));FileOutputFormat.setOutputPath(conf2, new Path(output)); } RunningJob job = JobClient.runJob(conf2); } Now, I want the first Job which gets created to write something in the distributed cache and the jobs which get created after the first job to read from the distributed cache. I came to know that the DistributedCache.addcacheFiles() method is deprecated, so the documentation suggests to use Job.addcacheFiles() method specific for each job. But, I am unable to get an handle of the currently running job, as JobClient.runJob(conf2) submits a job internally. How can I share the content written by the first job in this while loop available via distributed cache to other jobs which get created in later iterations of while loop ?
Accessing files in Hadoop 2.7.2 Distributed Cache
Hi, I want to use the distributed cache to allow my mappers to access data in Hadoop 2.7.2. In main, I'm using the command String hdfs_path="hdfs://localhost:9000/bloomfilter";InputStream in = new BufferedInputStream(new FileInputStream("/home/siddharth/Desktop/data/bloom_filter"));Configuration conf = new Configuration();fs = FileSystem.get(java.net.URI.create(hdfs_path), conf);OutputStream out = fs.create(new Path(hdfs_path)); //Copy file from local to HDFSIOUtils.copyBytes(in, out, 4096, true); System.out.println(hdfs_path + " copied to HDFS");DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2); DistributedCache.addCacheFile(new Path(hdfs_path).toUri(), conf2); The above code adds a file present on my local file system to HDFS and adds it to the distributed cache. However, in my mapper code, when I try to access the file stored in distributed cache, the Path[] P variable gets null value. d public void configure(JobConf conf) { this.conf = conf; try { Path [] p=DistributedCache.getLocalCacheFiles(conf);} catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } Even when I tried to access distributed cache from the following code in my mapper, the code returns the error that bloomfilter file doesn't exist strm = new DataInputStream(new FileInputStream("bloomfilter"));// Read into our Bloom filter.filter.readFields(strm);strm.close(); However, I read somewhere that if we add a file to distributed cache, we can access it directly from its name. Can you please help me out ?