HDFS metrics
I am using Yarn, and 1 - I want to know the average IO throughput of the HDFS (like know how fast the datanodes are writing in a disk) so that I can compare beween 2 HDFS intances. The command hdfs dfsadmin -report doesn't give me that. The HDFS has a command for that? 2 - and there is a similar thing to know how fast the data is being transferred between map and reduces? -- Best regards,
Re: HDFS metrics
http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/ On 12/06/2013 09:49, Pedro Sá da Costa wrote: I am using Yarn, and 1 - I want to know the average IO throughput of the HDFS (like know how fast the datanodes are writing in a disk) so that I can compare beween 2 HDFS intances. The command hdfs dfsadmin -report doesn't give me that. The HDFS has a command for that? 2 - and there is a similar thing to know how fast the data is being transferred between map and reduces? -- Best regards, -- Thanks Regards, Bhasker Allene
Get the history info in Yarn
I tried the command mapred job list all to get the history of the jobs completed, but the log doesn't have the time where a jobs started, end, the number of maps and reduce, and the size of data read and written. Can I get this info by a shell command? I am using Yarn. -- Best regards,
RE: Get the history info in Yarn
Hi, You can get all the details for Job using this mapred command mapred job status Job-ID For this you need to have Job History Server Running and the same job history server address configured in the client side. Thanks Regards Devaraj K From: Pedro Sá da Costa [mailto:psdc1...@gmail.com] Sent: Thursday, June 13, 2013 10:52 AM To: mapreduce-user Subject: Get the history info in Yarn I tried the command mapred job list all to get the history of the jobs completed, but the log doesn't have the time where a jobs started, end, the number of maps and reduce, and the size of data read and written. Can I get this info by a shell command? I am using Yarn. -- Best regards,
Task Tracker going down on hive cluster
In last 4-5 of day the task tracker on one of my slave machines has gone down couple of time. It has been working fine from the past 4-5 months The cluster configuration is 4 machine cluster on AWS 1 m2.xlarge master 3 m2.xlarge slaves The cluster is dedicated to run hive queries, with the data residing on s3. the slave on which the task tracker went down had the following log *** 2013-06-11 00:26:30,968 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60659, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005693_0, duration: 279198 2013-06-11 00:26:30,971 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.191.**.***:37605, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 193135 2013-06-11 00:26:30,971 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60630, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 192011 2013-06-11 00:26:30,972 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60656, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005693_0, duration: 178209 2013-06-11 00:26:30,973 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.8.***.**:45321, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005694_0, duration: 186452 2013-06-11 00:26:30,973 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60659, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005694_0, duration: 157360 2013-06-11 00:26:30,974 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.8.***.**:45321, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 157555 2013-06-11 00:26:30,991 INFO org.apache.hadoop.mapred.JvmManager: JVM Not killed jvm_201306071409_0151_m_-435659475 but just removed 2013-06-11 00:26:30,991 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201306071409_0151_m_-435659475 exited with exit code 0. Number of tasks it ran: 0 2013-06-11 00:26:30,991 ERROR org.apache.hadoop.mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker. org.apache.hadoop.fs.FSError: java.io.IOException: Broken pipe at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:200) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49) at java.io.DataOutputStream.write(DataOutputStream.java:107) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:220) at sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:315) at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:148) at java.io.OutputStreamWriter.close(OutputStreamWriter.java:233) at java.io.BufferedWriter.close(BufferedWriter.java:265) at java.io.PrintWriter.close(PrintWriter.java:312) at org.apache.hadoop.mapred.TaskController.writeCommand(TaskController.java:231) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:126) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:497) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:471) Caused by: java.io.IOException: Broken pipe at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:297) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:198) ... 13 more 2013-06-11 00:26:31,007 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201306071409_0151_m_-495709221 2013-06-11 00:26:31,008 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60656, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005694_0, duration: 222430 2013-06-11 00:26:31,008 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60653, bytes: 38, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005693_0, duration: 154027 2013-06-11 00:26:31,008 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.191.**.***:50060, dest: 10.190.***.***:60659, bytes: 6, op: MAPRED_SHUFFLE, cliID: attempt_201306071409_0151_m_005700_0, duration: 132067 2013-06-11 00:26:31,326 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201306071409_0151_m_-495709221 spawned. 2013-06-11 00:26:31,328 INFO org.apache.hadoop.mapred.TaskController: Writing commands to /mnt/app/hadoop-tmp/ttprivate/taskTracker/piyushv/jobcache/job_201306071409_0151/attempt_201306071409_0151_m_005717_0/taskjvm.sh 2013-06-11 00:26:31,331 INFO
Re: Container allocation on the same node
Hi Harsh, What will happen when I specify local host as the required host? Doesn't the resource manager give me all the containers on the local host? I don't want to constrain myself to the local host, which might be busy while other nodes in the cluster have enough resources available for me. Thanks, Kishore On Wed, Jun 12, 2013 at 6:45 PM, Harsh J ha...@cloudera.com wrote: You can request containers with the local host name as the required host, and perhaps reject and re-request if they aren't designated to be on that one until you have sufficient. This may take a while though. On Wed, Jun 12, 2013 at 6:25 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi, I want to get some containers for my application on the same node, is there a way to make such a request. For example, I have an application which needs 10 containers, but have a constraint that a set of those containers need to be running on the same node. Can I ask my resource manager to give me, let us say 5 containers on the same node? I know that there is now a way to specify the node name on which I need a container, but I don't bother which node in the cluster I get them allocated on, I just need them on the same node. Please suggest me if it is possible, and how can I do that? Thanks, Kishore -- Harsh J
Re: Management API
Rita, There aren't any specs as far as I know, but in my experience the interface is stable enough from version to version, with the occasional extra field added here or there. If you query specifically for the beans you want (e.g. http://namenode:50070/jmx?get=Hadoop:service=NameNode,name=NameNodeInfo::LiveNodes ) and build in some flexibility, you shouldn't have any problems. Regards, Marcos On 09-06-2013 11:30, Rita wrote: Are there any specs for the JSON schema? On Thu, Jun 6, 2013 at 9:49 AM, MARCOS MEDRADO RUBINELLI marc...@buscapecompany.commailto:marc...@buscapecompany.com wrote: Brian, If you have access to the web UI, you can get those metrics in JSON from the JMXJsonServlet. Try hitting http://namenode_hostname:50070/jmx?qry=Hadoop:* and http://jobtracker_v1_hostname:50030/jmx?qry=hadoop:* It isn't as extensive as other options, but if you just need a snapshot of node capacity and utilization, it's pretty handy. I used it to plug some basic warnings into Nagios. Regards, Marcos On 06-06-2013 09:51, Brian Mason wrote: I am looking for a way to access a list of Nodes, Compute, Data etc .. My application is not running on the name node. It is remote. The 2.0 Yarn API look like they may be useful, but I am not on 2.0 and cannot move to 2,0 anytime soon. DFSClient.java looks useful, but its not in the API docs so I am not sure how to use it or even if I should. Any pointers would be helpful. Thanks, -- --- Get your facts first, then you can distort them as you please.--
Re: Now give .gz file as input to the MAP
Rahul-da I found bz2 pretty slow (although splittable) so I switched to snappy (only sequence files are splittable but compress-decompress is fast) Thanks Sanjay From: Rahul Bhattacharjee rahul.rec@gmail.commailto:rahul.rec@gmail.com Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Date: Tuesday, June 11, 2013 9:53 PM To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Re: Now give .gz file as input to the MAP Nothing special is required for process .gz files using MR. however , as Sanjay mentioned , verify the codec's configured in core-site and another thing to note is that these files are not splittable. You might want to use bz2 , these are splittable. Thanks, Rahul On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com wrote: hadoopConf.set(mapreduce.job.inputformat.class, com.wizecommerce.utils.mapred.TextInputFormat); hadoopConf.set(mapreduce.job.outputformat.class, com.wizecommerce.utils.mapred.TextOutputFormat); No special settings required for reading Gzip except these above I u want to output Gzip hadoopConf.set(mapreduce.output.fileoutputformat.compress, true); hadoopConf.set(mapreduce.output.fileoutputformat.compress.codec, org.apache.hadoop.io.compress.GzipCodec); Make sure Gzip codec is defined in core-site.xml !-- core-site.xml -- property nameio.compression.codecs/name valueorg.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec/value /property I have a question Why are u using GZIP as input to Map ? These are not splittable…Unless u have to read multilines (like lines between a BEGIN and END block in a log file) and send it as one record to the mapper Also in Non-splitable Snappy Codec is better Good Luck sanjay From: samir das mohapatra samir.help...@gmail.commailto:samir.help...@gmail.com Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Date: Tuesday, June 11, 2013 9:07 PM To: cdh-u...@cloudera.commailto:cdh-u...@cloudera.com cdh-u...@cloudera.commailto:cdh-u...@cloudera.com, user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org, user-h...@hadoop.apache.orgmailto:user-h...@hadoop.apache.org user-h...@hadoop.apache.orgmailto:user-h...@hadoop.apache.org Subject: Now give .gz file as input to the MAP Hi All, Did any one worked on, how to pass the .gz file as file input for mapreduce job ? Regards, samir. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: Now give .gz file as input to the MAP
Yeah I too found that quite slow and memory hungry ! Thanks, Rahul-da On Wed, Jun 12, 2013 at 11:13 PM, Sanjay Subramanian sanjay.subraman...@wizecommerce.com wrote: Rahul-da I found bz2 pretty slow (although splittable) so I switched to snappy (only sequence files are splittable but compress-decompress is fast) Thanks Sanjay From: Rahul Bhattacharjee rahul.rec@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Tuesday, June 11, 2013 9:53 PM To: user@hadoop.apache.org user@hadoop.apache.org Subject: Re: Now give .gz file as input to the MAP Nothing special is required for process .gz files using MR. however , as Sanjay mentioned , verify the codec's configured in core-site and another thing to note is that these files are not splittable. You might want to use bz2 , these are splittable. Thanks, Rahul On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian sanjay.subraman...@wizecommerce.com wrote: hadoopConf.set(mapreduce.job.inputformat.class, com.wizecommerce.utils.mapred.TextInputFormat); hadoopConf.set(mapreduce.job.outputformat.class, com.wizecommerce.utils.mapred.TextOutputFormat); No special settings required for reading Gzip except these above I u want to output Gzip hadoopConf.set(mapreduce.output.fileoutputformat.compress, true); hadoopConf.set(mapreduce.output.fileoutputformat.compress.codec, org.apache.hadoop.io.compress.GzipCodec); Make sure Gzip codec is defined in core-site.xml !-- core-site.xml -- property nameio.compression.codecs/name value org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec/ value /property I have a question Why are u using GZIP as input to Map ? These are not splittable…Unless u have to read multilines (like lines between a BEGIN and END block in a log file) and send it as one record to the mapper Also in Non-splitable Snappy Codec is better Good Luck sanjay From: samir das mohapatra samir.help...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Tuesday, June 11, 2013 9:07 PM To: cdh-u...@cloudera.com cdh-u...@cloudera.com, user@hadoop.apache.org user@hadoop.apache.org, user-h...@hadoop.apache.org user-h...@hadoop.apache.org Subject: Now give .gz file as input to the MAP Hi All, Did any one worked on, how to pass the .gz file as file input for mapreduce job ? Regards, samir. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator. CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
RE: Shuffle design: optimization tradeoffs
In reading this link as well as the sailfish report, it strikes me that Hadoop skipped a potentially significant optimization. Namely, why are multiple sorted spill files merged into a single output file? Why not have the auxiliary service merge on the fly, thus avoiding landing them to disk? Was this considered and rejected due to placing memory/CPU requirements on the auxiliary service? I am assuming that whether the merge was done on disk or in a stream, it would require decompression/recompression of the data. John -Original Message- From: Albert Chu [mailto:ch...@llnl.gov] Sent: Tuesday, June 11, 2013 3:32 PM To: user@hadoop.apache.org Subject: Re: Shuffle design: optimization tradeoffs On Tue, 2013-06-11 at 16:00 +, John Lilley wrote: I am curious about the tradeoffs that drove design of the partition/sort/shuffle (Elephant book p 208). Doubtless this has been tuned and measured and retuned, but I’d like to know what observations came about during the iterative optimization process to drive the final design. For example: ·Why does the mapper output create a single ordered file containing all partitions, as opposed to a file per group of partitions (which would seem to lend itself better to multi-core scaling), or even a file per partition? I researched this awhile back wondering the same thing, and found this JIRA https://issues.apache.org/jira/browse/HADOOP-331 Al ·Why does the max number of streams to merge at once (is.sort.factor) default to 10? Is this obsolete? In my experience, so long as you have memory to buffer each input at 1MB or so, the merger is more efficient as a single phase. ·Why does the mapper do a final merge of the spill files do disk, instead of having the auxiliary process (in YARN) merge and stream data on the fly? ·Why do mappers sort the tuples, as opposed to only partitioning them and letting the reducers do the sorting? Sorry if this is overly academic, but I’m sure a lot of people put a lot of time into the tuning effort, and I hope they left a record of their efforts. Thanks John -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory
Aggregating data nested into JSON documents
Hello, I'm new to Hadoop. I have a large quantity of JSON documents with a structure similar to what is shown below. { g : some-group-identifier, sg: some-subgroup-identifier, j : some-job-identifier, page : 23, ... // other fields omitted important-data : [ { f1 : abc, f2 : a, f3 : / ... }, ... { f1 : xyz, f2 : q, f3 : /, ... }, ], ... // other fields omitted other-important-data : [ { x1 : ford, x2 : green, x3 : 35 map : { free-field : value, other-free-field : value2 } }, ... { x1 : vw, x2 : red, x3 : 54, ... }, ] }, } Each file contains a single JSON document (gzip compressed, and roughly about 200KB uncompressed of pretty-printed json text per document) I am interested in analyzing only the important-data array and the other-important-data array. My source data would ideally be easier to analyze if it looked like a couple of tables with a fixed set of columns. Only the column map would be a complex column, all others would be primitives. ( g, sg, j, page, f1, f2, f3 ) ( g, sg, j, page, x1, x2, x3, map ) So, for each JSON document, I would like to create several rows, but I would like to avoid the intermediate step of persisting -and duplicating- the flattened data. In order to avoid persisting the data flattened, I thought I had to write my own map-reduce in Java code, but discovered that others have had the same problem of using JSON as the source and there are somewhat standard solutions. By reading about the SerDe approach for Hive I get the impression that each JSON document is transformed into a single row of the table with some columns being an array, a map of other nested structures. a) Is there a way to break each JSON document into several rows for a Hive external table? b) It seems there are too many JSON SerDe libraries! Is there any of them considered the de-facto standard? The Pig approach seems also promising using Elephant Bird Do anybody has pointers to more user documentation on this project? Or is browsing through the examples in GitHub my only source? Thanks
Install CDH4 using tar ball with MRv1, Not YARN version
Hi folks, I am trying to install CDH4 using tar ball with MRv1, Not YARN version(MRv2). I downloaded two tarballs (mr1-0.20.2+n and hadoop-2.0.0+n) from this location http://archive.cloudera.com/cdh4/cdh/4/ as per cloudera instruction i found If you install CDH4 from a tarball, you will install YARN. To install MRv1 as well, install the separate MRv1 tarball (mr1-0.20.2+n) alongside the YARN one (hadoop-2.0.0+n). (@ bottom of http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_4_2.html ) But i could not find steps to install using these two tarballs Since cloudera tailored the steps to package installation. I am totally confused like whether to start dfs of hadoop-2.0.0+n version and start mapred of mr1-0.20.2+n or something else. Kindly help me on setting up. Thanks Selva
recovery accidently deleted pig script
Hi everyone, We have a pig script scheduled running every 4 hours. Someone accidentally deleted the pig script(rm). Is there any way to recover the script? I am guessing Hadoop copy the program to every nodes before running. Just in case it has any copy in the nodes. Best regards, Feng Jiang
Re: recovery accidently deleted pig script
Where was the pig script? On HDFS? How often does your cluster clean up the trash? (Deleted stuff doesn't get cleaned up when the file is deleted... ) Its a configurable setting so YMMV On Jun 12, 2013, at 8:58 PM, feng jiang jiangfut...@gmail.com wrote: Hi everyone, We have a pig script scheduled running every 4 hours. Someone accidentally deleted the pig script(rm). Is there any way to recover the script? I am guessing Hadoop copy the program to every nodes before running. Just in case it has any copy in the nodes. Best regards, Feng Jiang
Re: SSD support in HDFS
I could have sworn there was a thread on this already. (Maybe the HBase list?) Andrew P. kinda nailed it when he talked about the fact that you had to write the replication(s). If you wanted improved performance, why not look at the hybrid drives that have a small SSD buffer and a spinning disk? I don't know but it may be what you're looking for. HTH -Mike On Jun 12, 2013, at 5:18 AM, Lucas Stanley lucas23...@gmail.com wrote: Thanks Chris and Phil. On Tue, Jun 11, 2013 at 1:31 PM, Chris Nauroth cnaur...@hortonworks.com wrote: Hi Lucas, HDFS does not have this capability right now, but there has been some preliminary discussion around adding features to support it. You might want to follow jira issues HDFS-2832 and HDFS-4672 if you'd like to receive notifications about the discussion. https://issues.apache.org/jira/browse/HDFS-2832 https://issues.apache.org/jira/browse/HDFS-4672 Chris Nauroth Hortonworks http://hortonworks.com/ On Mon, Jun 10, 2013 at 6:57 PM, Lucas Stanley lucas23...@gmail.com wrote: Hi, Is it possible to tell Apache HDFS to store some files on SSD and the rest of the files on spinning disks? So if each on my nodes has 1 SSD and 5 spinning disks, can I configure a directory in HDFS to put all files in that dir on the SSD? I think Intel's Hadoop distribution is working on some SSD support right?
Compatibility of Hadoop 0.20.x and hadoop 1.0.3
Hi, all, I was wondering could an application written with hadoop 0.20.3 API run on a hadoop 1.0.3 cluster? If not, is there any way to run this application on hadoop 1.0.3 instead of re-writing all the code?? -- Lin Yang
Reducer not getting called
Hi, I have a SequenceFile which contains several jpeg images with (image name, image bytes) as key-value pairs. My objective is to count the no. of images by grouping them by the source, something like this : Nikon Coolpix 100 Sony Cybershot 251 N82 100 The MR code is : package com.hadoop.basics; import java.io.BufferedInputStream; import java.io.ByteArrayInputStream; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import com.drew.imaging.ImageMetadataReader; import com.drew.imaging.ImageProcessingException; import com.drew.metadata.Directory; import com.drew.metadata.Metadata; import com.drew.metadata.exif.ExifIFD0Directory; public class ImageSummary extends Configured implements Tool { public static class ImageSourceMapper extends MapperText, BytesWritable, Text, IntWritable { private static int tagId = 272; private static final IntWritable one = new IntWritable(1); public void map(Text imageName, BytesWritable imageBytes, Context context) throws IOException, InterruptedException { // TODO Auto-generated method stub System.out.println(In the map method, image is + imageName.toString()); byte[] imageInBytes = imageBytes.getBytes(); ByteArrayInputStream bais = new ByteArrayInputStream(imageInBytes); BufferedInputStream bis = new BufferedInputStream(bais); Metadata imageMD = null; try { imageMD = ImageMetadataReader.readMetadata(bis, true); } catch (ImageProcessingException e) { // TODO Auto-generated catch block System.out.println(Got an ImageProcessingException !); e.printStackTrace(); } Directory exifIFD0Directory = imageMD .getDirectory(ExifIFD0Directory.class); String imageSource = exifIFD0Directory.getString(tagId); System.out.println(imageName.toString() + is taken using + imageSource); context.write(new Text(imageSource), one); System.out.println(Returning from the map method); } } public static class ImageSourceReducer extends ReducerText, IntWritable, Text, IntWritable { public void reduce(Text imageSource, IteratorIntWritable counts, Context context) throws IOException, InterruptedException { // TODO Auto-generated method stub System.out.println(In the reduce method); int finalCount = 0; while (counts.hasNext()) { finalCount += counts.next().get(); } context.write(imageSource, new IntWritable(finalCount)); System.out.println(Returning from the reduce method); } } public static void main(String[] args) throws Exception { ToolRunner.run(new ImageSummary(), args); } @Override public int run(String[] args) throws Exception { // TODO Auto-generated method stub System.out.println(In ImageSummary.run(...)); Configuration configuration = getConf();
Re: Reducer not getting called
You're not using the recommended @Override annotations, and are hitting a classic programming mistake. Your issue is same as this earlier discussion: http://search-hadoop.com/m/gqA3rAaVQ7 (and the ones before it). On Thu, Jun 13, 2013 at 9:52 AM, Omkar Joshi omkar.jo...@lntinfotech.com wrote: Hi, I have a SequenceFile which contains several jpeg images with (image name, image bytes) as key-value pairs. My objective is to count the no. of images by grouping them by the source, something like this : Nikon Coolpix 100 Sony Cybershot 251 N82 100 The MR code is : package com.hadoop.basics; import java.io.BufferedInputStream; import java.io.ByteArrayInputStream; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import com.drew.imaging.ImageMetadataReader; import com.drew.imaging.ImageProcessingException; import com.drew.metadata.Directory; import com.drew.metadata.Metadata; import com.drew.metadata.exif.ExifIFD0Directory; public class ImageSummary extends Configured implements Tool { public static class ImageSourceMapper extends MapperText, BytesWritable, Text, IntWritable { private static int tagId = 272; private static final IntWritable one = new IntWritable(1); public void map(Text imageName, BytesWritable imageBytes, Context context) throws IOException, InterruptedException { // TODO Auto-generated method stub System.out.println(In the map method, image is + imageName.toString()); byte[] imageInBytes = imageBytes.getBytes(); ByteArrayInputStream bais = new ByteArrayInputStream(imageInBytes); BufferedInputStream bis = new BufferedInputStream(bais); Metadata imageMD = null; try { imageMD = ImageMetadataReader.readMetadata(bis, true); } catch (ImageProcessingException e) { // TODO Auto-generated catch block System.out.println(Got an ImageProcessingException !); e.printStackTrace(); } Directory exifIFD0Directory = imageMD .getDirectory(ExifIFD0Directory.class); String imageSource = exifIFD0Directory.getString(tagId); System.out.println(imageName.toString() + is taken using + imageSource); context.write(new Text(imageSource), one); System.out.println(Returning from the map method); } } public static class ImageSourceReducer extends ReducerText, IntWritable, Text, IntWritable { public void reduce(Text imageSource, IteratorIntWritable counts, Context context) throws IOException, InterruptedException { // TODO Auto-generated method stub System.out.println(In the reduce method); int finalCount = 0; while (counts.hasNext()) { finalCount += counts.next().get(); } context.write(imageSource, new IntWritable(finalCount)); System.out.println(Returning from the reduce method); } } public static
RE: Reducer not getting called
Ok but that link is broken - can you provide a working one? Regards, Omkar Joshi -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Thursday, June 13, 2013 11:01 AM To: user@hadoop.apache.org Subject: Re: Reducer not getting called You're not using the recommended @Override annotations, and are hitting a classic programming mistake. Your issue is same as this earlier discussion: http://search-hadoop.com/m/gqA3rAaVQ7 (and the ones before it). On Thu, Jun 13, 2013 at 9:52 AM, Omkar Joshi omkar.jo...@lntinfotech.com wrote: Hi, I have a SequenceFile which contains several jpeg images with (image name, image bytes) as key-value pairs. My objective is to count the no. of images by grouping them by the source, something like this : Nikon Coolpix 100 Sony Cybershot 251 N82 100 The MR code is : package com.hadoop.basics; import java.io.BufferedInputStream; import java.io.ByteArrayInputStream; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import com.drew.imaging.ImageMetadataReader; import com.drew.imaging.ImageProcessingException; import com.drew.metadata.Directory; import com.drew.metadata.Metadata; import com.drew.metadata.exif.ExifIFD0Directory; public class ImageSummary extends Configured implements Tool { public static class ImageSourceMapper extends MapperText, BytesWritable, Text, IntWritable { private static int tagId = 272; private static final IntWritable one = new IntWritable(1); public void map(Text imageName, BytesWritable imageBytes, Context context) throws IOException, InterruptedException { // TODO Auto-generated method stub System.out.println(In the map method, image is + imageName.toString()); byte[] imageInBytes = imageBytes.getBytes(); ByteArrayInputStream bais = new ByteArrayInputStream(imageInBytes); BufferedInputStream bis = new BufferedInputStream(bais); Metadata imageMD = null; try { imageMD = ImageMetadataReader.readMetadata(bis, true); } catch (ImageProcessingException e) { // TODO Auto-generated catch block System.out.println(Got an ImageProcessingException !); e.printStackTrace(); } Directory exifIFD0Directory = imageMD .getDirectory(ExifIFD0Directory.class); String imageSource = exifIFD0Directory.getString(tagId); System.out.println(imageName.toString() + is taken using + imageSource); context.write(new Text(imageSource), one); System.out.println(Returning from the map method); } } public static class ImageSourceReducer extends ReducerText, IntWritable, Text, IntWritable { public void reduce(Text imageSource, IteratorIntWritable counts, Context context) throws IOException, InterruptedException { // TODO Auto-generated method stub System.out.println(In the reduce method); int finalCount = 0; while (counts.hasNext()) { finalCount += counts.next().get(); }
Re: Compatibility of Hadoop 0.20.x and hadoop 1.0.3
Hi, Vinod, Thanks.* * 2013/6/13 Vinod Kumar Vavilapalli vino...@hortonworks.com It should mostly work. I just checked our CHANGES.txt file and haven't seen much incompatibilities introduced between those releases. But 0.20.3 is pretty old, so only one way to know for sure - compile and run against 1.x. If you are making that jump, you may as well use the latest releases in 1.x line. Thanks, +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On Jun 12, 2013, at 8:34 PM, Lin Yang wrote: Hi, all, I was wondering could an application written with hadoop 0.20.3 API run on a hadoop 1.0.3 cluster? If not, is there any way to run this application on hadoop 1.0.3 instead of re-writing all the code?? -- Lin Yang -- Lin Yang