Counters across all jobs
Hi, I have around 4 jobs running in a controller. How can i have a single unique counter present in all the jobs and incremented where ever used in a job? For example:Consider a counter ACount. If job1 is incrementing the counter by2 and job3 by 5 and job 4 by 6. Can i have the counter displayed output in the jobtracker as job1:2 job2:2 job3:7 job4:13 Thanks, Subbu
Hadoop or HBase
Hi, I wants to use DFS for Content-Management-System (CMS), in that I just wants to store and retrieve files. Please suggest me what should I use: Hadoop or HBase Thanks Regards, Kushal Agrawal mailto:kushalagra...@teledna.com kushalagra...@teledna.com cid:image001.jpg@01CBF854.8DD096F0 One Earth. Your moment. Go green... _ This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
Re: Hadoop or HBase
Typically, CMSs require a RDBMS. Which Hadoop and HBase are not. Which CMS do you plan to use, and what's wrong with MySQL or other open source RDBMSs? Kai Am 28.08.2012 um 08:21 schrieb Kushal Agrawal kushalagra...@teledna.com: Hi, I wants to use DFS for Content-Management-System (CMS), in that I just wants to store and retrieve files. Please suggest me what should I use: Hadoop or HBase Thanks Regards, Kushal Agrawal kushalagra...@teledna.com One Earth. Your moment. Go green... This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited. -- Kai Voigt k...@123.org
RE: Hadoop or HBase
As the data is too much in (10's of terabytes) it's difficult to take backup because it takes 1.5 days to take backup of data every time. Instead of that if we uses distributed file system we need not to do that. Thanks Regards, Kushal Agrawal kushalagra...@teledna.com -Original Message- From: Kai Voigt [mailto:k...@123.org] Sent: Tuesday, August 28, 2012 11:57 AM To: common-user@hadoop.apache.org Subject: Re: Hadoop or HBase Typically, CMSs require a RDBMS. Which Hadoop and HBase are not. Which CMS do you plan to use, and what's wrong with MySQL or other open source RDBMSs? Kai Am 28.08.2012 um 08:21 schrieb Kushal Agrawal kushalagra...@teledna.com: Hi, I wants to use DFS for Content-Management-System (CMS), in that I just wants to store and retrieve files. Please suggest me what should I use: Hadoop or HBase Thanks Regards, Kushal Agrawal kushalagra...@teledna.com One Earth. Your moment. Go green... This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited. -- Kai Voigt k...@123.org
Re: Hadoop or HBase
Having a distributed filesystem doesn't save you from having backups. If someone deletes a file in HDFS, it's gone. What backend storage is supported by your CMS? Kai Am 28.08.2012 um 08:36 schrieb Kushal Agrawal kushalagra...@teledna.com: As the data is too much in (10's of terabytes) it's difficult to take backup because it takes 1.5 days to take backup of data every time. Instead of that if we uses distributed file system we need not to do that. Thanks Regards, Kushal Agrawal kushalagra...@teledna.com -Original Message- From: Kai Voigt [mailto:k...@123.org] Sent: Tuesday, August 28, 2012 11:57 AM To: common-user@hadoop.apache.org Subject: Re: Hadoop or HBase Typically, CMSs require a RDBMS. Which Hadoop and HBase are not. Which CMS do you plan to use, and what's wrong with MySQL or other open source RDBMSs? Kai Am 28.08.2012 um 08:21 schrieb Kushal Agrawal kushalagra...@teledna.com: Hi, I wants to use DFS for Content-Management-System (CMS), in that I just wants to store and retrieve files. Please suggest me what should I use: Hadoop or HBase Thanks Regards, Kushal Agrawal kushalagra...@teledna.com One Earth. Your moment. Go green... This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited. -- Kai Voigt k...@123.org -- Kai Voigt k...@123.org
heap size for the job tracker
1 Can I change the heap size for the job tracker only if I am using version 1.0.2? 2 If so, would you please say what exact line I should put in the hadoop-evv.sh and where ? Should I set the value with a number or use the Xmx notion? I mean which one is the correct way export HADOOP_HEAPSIZE=2000 or export HADOOP_HEAPSIZE=-Xmx2000m 3 Do I need to restart the job tracker node or call start-mapred.sh to make the heap size change to take in effect? Is there anything else I need to do to make the change to be applied?
Re: heap size for the job tracker
Hi Mike, On Wed, Aug 29, 2012 at 7:40 AM, Mike S mikesam...@gmail.com wrote: 1 Can I change the heap size for the job tracker only if I am using version 1.0.2? Yes. 2 If so, would you please say what exact line I should put in the hadoop-evv.sh and where ? Should I set the value with a number or use the Xmx notion? I mean which one is the correct way export HADOOP_HEAPSIZE=2000 or export HADOOP_HEAPSIZE=-Xmx2000m The above (first one is right syntax) changes the heap across _all_ daemons, not just JT specifically. So you don't want to do that. You may instead find and change the below line in hadoop-env.sh to the following: export HADOOP_JOBTRACKER_OPTS=$HADOOP_JOBTRACKER_OPTS -Xmx2g 3 Do I need to restart the job tracker node or call start-mapred.sh to make the heap size change to take in effect? Is there anything else I need to do to make the change to be applied? You will need to restart the JobTracker JVM for the new heap limit to get used. You can run hadoop-daemon.sh stop jobtracker followed by hadoop-daemon.sh start jobtracker to restart just the JobTracker daemon (run the command on the JT node). -- Harsh J
Suggestions/Info required regarding Hadoop Benchmarking
Hi Users, I have a 12 node CDH3 cluster where I am planning to run some benchmark tests. My main intension is to run the benchmarks first with the default Hadoop configuration and then analyze the outcomes and tune the Hadoop metrics accordingly to increase the performance of my cluster. Can some one provide me some suggestions that which are the important Hadoop metrics that I should observe during benchmarking? Also, I have seen somewhere that the ratio of Avg Map Tasks and Avg Reduce Tasks Execution Time is recorded for various benchmarks. How significant is that information for me to judge the cluster performance? How the ratios will help me to analyze and tune the Hadoop cluster accordingly for increase in performance. Till now I have run the following benchmarks without tuning the cluster (with default Hadoop configuration): - Sort - WordCount - TeraSort - TestDFSIO Please provide suggestion that which are the other benchmarks that I should run, especially from hadoop-test.jar in $HADOOP_HOME directory and what are the usage of those jobs. Thanks, Gaurav Dasgupta
example usage of s3 file system
Hi I am trying to use the Hadoop filesystem abstraction with S3 but in my tinkering I am not having a great deal of success. I am particularly interested in the ability to mimic a directory structure (since s3 native doesnt do it). Can anyone point me to some good example usage of Hadoop FileSystem with s3? I created a few directories using transit and AWS S3 console for test. Doing a liststatus of the bucket returns a FileStatus object of the directory created but if I try to do a liststatus of that path I am getting a 404: org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: Request Error. HEAD '/' on Host Probably not the best list to look for help, any clues appreciated. C
Re: error in shuffle in InMemoryMerger
Hi Abhay, Ideally the error line - Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_128.out suggests you either do not have permissions for output folder or disk is full. Also 5 is not a big number on thread spawning, (infact, default on parallelcopies) to recommend reducing it, but a lower value might work.only long-term indications are for your system to under-go node maintenance. Thanks Rekha From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.commailto:abhay.ratnapar...@gmail.com Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Date: Tue, 28 Aug 2012 14:52:27 +0530 To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: error in shuffle in InMemoryMerger Hello, I am getting following error when reduce task is running. mapreduce.reduce.shuffle.parallelcopies property is set to 5. org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in InMemoryMerger - Thread to merge in-memory shuffled map-outputs at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(AccessController.java:284) at javax.security.auth.Subject.doAs(Subject.java:573) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:773) at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_128.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputF org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in InMemoryMerger - Thread to merge in-memory shuffled map-outputs at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(AccessController.java:284) at javax.security.auth.Subject.doAs(Subject.java:573) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:773) at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_119.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputF Regards, Abhay
hadoop 1.0.3 equivalent of MultipleTextOutputFormat
Hi, I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat is good for writing results into (for example) different directories created on the fly. However, now I'm implementing a MapReduce job using Hadoop 1.0.3, I see that the new API no longer supports MultipleTextOutputFormat. Is there an equivalent that I can use, or will it be supported in a future release? Thanks, Tony ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM
Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
The Multiple*OutputFormat have been deprecated in favor of the generic MultipleOutputs API. Would using that instead work for you? On Tue, Aug 28, 2012 at 6:05 PM, Tony Burton tbur...@sportingindex.com wrote: Hi, I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat is good for writing results into (for example) different directories created on the fly. However, now I'm implementing a MapReduce job using Hadoop 1.0.3, I see that the new API no longer supports MultipleTextOutputFormat. Is there an equivalent that I can use, or will it be supported in a future release? Thanks, Tony ** This email and any attachments are confidential, protected by copyright and may be legally privileged. If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened. It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email. Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial promotion contained herein has been issued and approved by Sporting Index Ltd. Outbound email has been scanned for viruses and SPAM -- Harsh J
Re: error in shuffle in InMemoryMerger
Checked the mapred.tmp.local directory on the node which is running the reducer attempt and seems that there is available space around 1G(though it's less). On Tue, Aug 28, 2012 at 3:55 PM, Joshi, Rekha rekha_jo...@intuit.comwrote: Hi Abhay, Ideally the error line - Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_128.out suggests you either do not have permissions for output folder or disk is full. Also 5 is not a big number on thread spawning, (infact, default on parallelcopies) to recommend reducing it, but a lower value might work.only long-term indications are for your system to under-go node maintenance. Thanks Rekha From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com Reply-To: user@hadoop.apache.org Date: Tue, 28 Aug 2012 14:52:27 +0530 To: user@hadoop.apache.org Subject: error in shuffle in InMemoryMerger Hello, I am getting following error when reduce task is running. mapreduce.reduce.shuffle.parallelcopies property is set to 5. org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in InMemoryMerger - Thread to merge in-memory shuffled map-outputs at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(AccessController.java:284) at javax.security.auth.Subject.doAs(Subject.java:573) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:773) at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_128.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputF org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in InMemoryMerger - Thread to merge in-memory shuffled map-outputs at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(AccessController.java:284) at javax.security.auth.Subject.doAs(Subject.java:573) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:773) at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_119.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:351) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:132) at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputF Regards, Abhay
one reducer is hanged in reduce- copy phase
Hello, I have a MR job which has 4 reducers running. One of the reduce attempt is pending since long time in reduce-copy phase. The job is not able to complete because of this. I have seen that the child java process on tasktracker is running. Is it possible to run the same attempt again? Does killing the child java process or tasktracker on the node help? (since hadoop may schedule a reduce attempt on another node). Can I copy the map intermediate output required for this single reducer (which is hanged) and rerun only the hang reducer? Thank you in advance. ~Abhay ask_201208250623_0005_r_00http://dpep089.innovate.ibm.com:50030/taskdetails.jsp?tipid=task_201208250623_0005_r_00 26.41% reduce copy(103 of 130 at 0.08 MB/s) 28-Aug-2012 03:09:34
Re: best way to join?
On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan dextermorga...@gmail.comwrote: I understand your solution ( i think) , didn't think of that, in that particular way. I think that lets say i have 1M data-points, and running knn , that the k=1M and n=10 (each point is a cluster that requires up to 10 points) is an overkill. I am not sure I understand you. n = number of points. k = number of clusters. For searching 1 million points, I would recommend thousands of clusters. How can i achieve the same result WITHOUT using mahout, just running on the dataset , i even think it'll be in the same complexity (o(n^2)) Running with a good knn package will give you roughly O(n log n) complexity.
Re: Hadoop or HBase
Regards to all the list. Well, you should ask to the Tumblr´s fellows that they use a combination of MySQL and HBase for its blogging platform. They talked about this topic in the last HBaseCon. Here is the link: http://www.hbasecon.com/sessions/growing-your-inbox-hbase-at-tumblr/ Blake Matheny, Director of Platform Engineering at Tumblr was the presenter of this topic. Best wishes El 28/08/2012 6:18, Kai Voigt escribió: Having a distributed filesystem doesn't save you from having backups. If someone deletes a file in HDFS, it's gone. What backend storage is supported by your CMS? Kai Am 28.08.2012 um 08:36 schrieb Kushal Agrawal kushalagra...@teledna.com: As the data is too much in (10's of terabytes) it's difficult to take backup because it takes 1.5 days to take backup of data every time. Instead of that if we uses distributed file system we need not to do that. Thanks Regards, Kushal Agrawal kushalagra...@teledna.com -Original Message- From: Kai Voigt [mailto:k...@123.org] Sent: Tuesday, August 28, 2012 11:57 AM To: common-u...@hadoop.apache.org Subject: Re: Hadoop or HBase Typically, CMSs require a RDBMS. Which Hadoop and HBase are not. Which CMS do you plan to use, and what's wrong with MySQL or other open source RDBMSs? Kai Am 28.08.2012 um 08:21 schrieb Kushal Agrawal kushalagra...@teledna.com: Hi, I wants to use DFS for Content-Management-System (CMS), in that I just wants to store and retrieve files. Please suggest me what should I use: Hadoop or HBase Thanks Regards, Kushal Agrawal kushalagra...@teledna.com One Earth. Your moment. Go green... This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited. -- Kai Voigt k...@123.org 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: distcp error.
Hi, Tao. This problem is only with 2.0.1 or with the two versions? Have you tried to use distcp from 1.0.3 to 1.0.3? El 28/08/2012 11:36, Tao escribió: Hi, all I use distcp copying data from hadoop1.0.3 to hadoop 2.0.1. When the file path(or file name) contain Chinese character, an exception will throw. Like below. I need some help about this. Thanks. [hdfs@host ~]$ hadoop distcp -i -prbugp -m 14 -overwrite -log /tmp/distcp.log hftp://10.xx.xx.aa:50070/tmp/中文路径测试hdfs: //10.xx.xx.bb:54310/tmp/distcp_test14 12/08/28 23:32:31 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=true, maxMaps=14, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hftp://10.xx.xx.aa:50070/tmp/中文 路径测试], targetPath=hdfs://10.xx.xx.bb:54310/tmp/distcp_test14} 12/08/28 23:32:33 INFO tools.DistCp: DistCp job log path: /tmp/distcp.log 12/08/28 23:32:34 WARN conf.Configuration: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb 12/08/28 23:32:34 WARN conf.Configuration: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor 12/08/28 23:32:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/08/28 23:32:36 INFO mapreduce.JobSubmitter: number of splits:1 12/08/28 23:32:36 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar 12/08/28 23:32:36 WARN conf.Configuration: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 12/08/28 23:32:36 WARN conf.Configuration: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 12/08/28 23:32:36 WARN conf.Configuration: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 12/08/28 23:32:36 WARN conf.Configuration: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 12/08/28 23:32:36 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name 12/08/28 23:32:36 WARN conf.Configuration: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 12/08/28 23:32:36 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 12/08/28 23:32:36 WARN conf.Configuration: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 12/08/28 23:32:36 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 12/08/28 23:32:36 WARN conf.Configuration: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class 12/08/28 23:32:36 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 12/08/28 23:32:37 INFO mapred.ResourceMgrDelegate: Submitted application application_1345831938927_0039 to ResourceManager at baby20/10.1.1.40:8040 12/08/28 23:32:37 INFO mapreduce.Job: The url to track the job: http://baby20:8088/proxy/application_1345831938927_0039/ 12/08/28 23:32:37 INFO tools.DistCp: DistCp job-id: job_1345831938927_0039 12/08/28 23:32:37 INFO mapreduce.Job: Running job: job_1345831938927_0039 12/08/28 23:32:50 INFO mapreduce.Job: Job job_1345831938927_0039 running in uber mode : false 12/08/28 23:32:50 INFO mapreduce.Job: map 0% reduce 0% 12/08/28 23:33:00 INFO mapreduce.Job: map 100% reduce 0% 12/08/28 23:33:00 INFO mapreduce.Job: Task Id : attempt_1345831938927_0039_m_00_0, Status : FAILED Error: java.io.IOException: File copy failed: hftp://10.1.1.26:50070 /tmp/中文路径测试/part-r-00017 -- hdfs://10.1.1.40:54310/tmp/distcp_test14/part-r-00017 at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:262) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:229) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:45) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:725) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147) Caused by: java.io.IOException: Couldn't run retriable-command: Copying hftp://10.1.1.26:50070/tmp/中文路径测试/part-r-00017 to hdfs://10.1.1.40:54310/tmp/distcp_test14/part-r-00017 at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101) at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:258) ... 10 more Caused by:
Re: How to reduce total shuffle time
Without knowing your exact workload, using a Combiner (if possible) as Tsuyoshi recommended should decrease your total shuffle time. You can also try compressing the map output so that there's less disk and network IO. Here's an example configuration using Snappy: conf.set(mapred.compress.map.output,true); conf.set(mapred.map.output.compression.codec,org.apache.hadoop.io.compress.SnappyCodec); HTH, Minh On Tue, Aug 28, 2012 at 4:37 AM, Tsuyoshi OZAWA ozawa.tsuyo...@lab.ntt.co.jp wrote: It depends of workload. Could you tell us more specification about your job? In general case which reducers are bottleneck, there are some tuning techniques as follows: 1. Allocate more memory to reducers. It decreases disk IO of reducers when merging and running reduce functions. 2. Use combine function, which enable mapper-side aggregation processing, if your MR job consists of the operations that satisfy both the commutative and the associative low. See also about combine functions: http://wiki.apache.org/hadoop/HadoopMapReduce Tsuyoshi On Tuesday, August 28, 2012, Gaurav Dasgupta wrote: Hi, I have run some large and small jobs and calculated the Total Shuffle Time for the jobs. I can see that the Total Shuffle Time is almost half the Total Time which was taken by the full job to complete. My question, here, is that how can we decrease the Total Shuffle Time? And doing so, what will be its effect on the Job? Thanks, Gaurav Dasgupta
datanode has no storageID
Hi, hope it's not a newby question... I installed several versions of hadoop for testing, (0.20.203, 0.21.0, and 1.0.3) on various machines. now I am using 1.0.3 on all the machines, I face a problem that in some of the machhines, the datanode gets no storageID from the namenode. where it works, the datanode has the following lines in the log file: (and current/VERSION has a storageID= some ID ) --- 2012-08-28 19:04:31,415 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = DatanodeRegistration(datanode-works.cs.tau.ac.il:50010, storageID=DS-996163017-machines-ip-50010-1342683478942, infoPort=50075, ipcPort=50020) 2012-08-28 19:04:31,418 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting asynchronous block report scan 2012-08-28 19:04:31,418 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(machines-ip:50010, storageID=DS-996163017-machines-ip-50010-1342683478942, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset{dirpath='/var/cache/hdfs/hadoop-data-node/current'} 2012-08-28 19:04:31,419 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting --- where it doesn't work I have only the first line and it hangs. (and current/VERSION has a 'storageID=' empty value line ) ++ 2012-08-28 18:42:01,297 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = DatanodeRegistration(machinename.cs.tau.ac.il:50010, storageID=, infoPort=50075, ipcPort=50020) 2012-08-28 18:42:01,287 INFO org.apache.hadoop.ipc.Server: Starting SocketReader ++ 1. any Ideas? 2. how/where does the namenode stores the datanodes's storageids ? 3. how can I get a new storageid for a datanode or it's old ID ? 4. can I format/reset the namenode to enable the datanode to reconnect ? thanks! - Boaz Yarom CS System Team 03-640-8961 / 7637
Hadoop and MainFrame integration
Hi Users. We have flat files on mainframes with around a billion records. We need to sort them and then use them with different jobs on mainframe for report generation. I was wondering was there any way I could integrate the mainframe with hadoop do the sorting and keep the file on the sever itself ( I do not want to ftp the file to a hadoop cluster and then ftp back the sorted file to Mainframe as it would waste MIPS and nullify the advantage ). This way I could save on MIPS and ultimately improve profitability. Thank you in advance ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! Every duty is holy, and devotion to duty is the highest form of worship of God.” Maybe other people will try to limit me but I don't limit myself
Re: Hadoop and MainFrame integration
At some point in the work flow you're going to have to transfer the file from the mainframe to the Hadoop cluster for processing, and then send it back for storage on the mainframe. You should be able to automate the process of sending the files back and forth. It's been my experience that it's often faster to process and sort large files on a Hadoop cluster even while factoring in the cost to transfer to/from the mainframe. Hopefully that answers your question. If not, are you looking to actually use Hadoop to process files in place on the mainframe? That concept conflicts with my understanding of Hadoop. On Tue, Aug 28, 2012 at 12:24 PM, Siddharth Tiwari siddharth.tiw...@live.com wrote: Hi Users. We have flat files on mainframes with around a billion records. We need to sort them and then use them with different jobs on mainframe for report generation. I was wondering was there any way I could integrate the mainframe with hadoop do the sorting and keep the file on the sever itself ( I do not want to ftp the file to a hadoop cluster and then ftp back the sorted file to Mainframe as it would waste MIPS and nullify the advantage ). This way I could save on MIPS and ultimately improve profitability. Thank you in advance **** *Cheers !!!* *Siddharth Tiwari* Have a refreshing day !!! *Every duty is holy, and devotion to duty is the highest form of worship of God.” * *Maybe other people will try to limit me but I don't limit myself*
Re: one reducer is hanged in reduce- copy phase
Hi Abhay The map outputs are deleted only after the reducer runs to completion. Is it possible to run the same attempt again? Does killing the child java process or tasktracker on the node help? (since hadoop may schedule a reduce attempt on another node). Yes,it is possible to re attempt the task again for that you need to fail the current attempt. Can I copy the map intermediate output required for this single reducer (which is hanged) and rerun only the hang reducer? It is not that easy to accomplish this. Better fail the task explicitly so that the it is re attempted. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com Date: Tue, 28 Aug 2012 19:40:58 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: one reducer is hanged in reduce- copy phase Hello, I have a MR job which has 4 reducers running. One of the reduce attempt is pending since long time in reduce-copy phase. The job is not able to complete because of this. I have seen that the child java process on tasktracker is running. Is it possible to run the same attempt again? Does killing the child java process or tasktracker on the node help? (since hadoop may schedule a reduce attempt on another node). Can I copy the map intermediate output required for this single reducer (which is hanged) and rerun only the hang reducer? Thank you in advance. ~Abhay ask_201208250623_0005_r_00http://dpep089.innovate.ibm.com:50030/taskdetails.jsp?tipid=task_201208250623_0005_r_00 26.41% reduce copy(103 of 130 at 0.08 MB/s) 28-Aug-2012 03:09:34
example usage of s3 file system
Hi I am trying to use the Hadoop filesystem abstraction with S3 but in my tinkering I am not having a great deal of success. I am particularly interested in the ability to mimic a directory structure (since s3 native doesnt do it). Can anyone point me to some good example usage of Hadoop FileSystem with s3? I created a few directories using transit and AWS S3 console for test. Doing a liststatus of the bucket returns a FileStatus object of the directory created but if I try to do a liststatus of that path I am getting a 404: org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: Request Error. HEAD '/' on Host Probably not the best list to look for help, any clues appreciated. C
Hadoop Streaming question
Hi all, I am using Python on CDH3u3 for streaming. I do not know how to provide command-line arguments. My python mapper takes in 3 arguments - 2 input files and one placeholder for an output file. I am doing something like this, but fails. Where am I going wrong? What other options do I have? Any best practices? I am using cmdenv, but, do not know how exactly to use it. I have seen this question on the net, but, I have not found a working answer.. HDFS_INPUT_1=/user/kk/book/eccfile.txt HDFS_INPUT_2=/user/kk/book/calist.txt LOCAL_INPUT_1=$KK_HOME/eccfile.txt LOCAL_INPUT_2=$KK_HOME/calist.txt HDFS_OUTPUT=/user/kk/book/eccoutput LOCAL_OUTPUT=$KK_HOME/ hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \ -D mapred.job.name=CM \ -D mapred.reduce.tasks=0 \ -files $LOCAL_INPUT_1, $LOCAL_INPUT_2 \ -input $HDFS_INPUT_1 \ -output $HDFS_OUTPUT \ -file $KK_HOME/ec_ca.py \ -cmdenv arg1=$LOCAL_INPUT_1 \ -cmdenv arg2=$LOCAL_INPUT_2 \ -cmdenv arg3=$LOCAL_OUTPUT \ -mapper $KK_HOME/ec_ca.py $arg1 $arg2 $arg3 == Some more related questions: 1. what is the option for sending a file to all the nodes (say, arg2). This file is a reference input file that is needed for processing. Should I use the option-files? like DistributedCache. 2. I really do not know what happens if I specify an output file (in local dir). I understand that specifying a HDFS location for output will nicely place the output in that dir. My Python script writes the output into a local directory - which I tested and worked fine locally. But, what really happens when I try to run on Hadoop? This is my $arg3. Thanks and appreciate your help, PD.
Re: Hadoop and MainFrame integration
build a custom transfer mechanism in Java and use a zaap so you won't consume mips On Aug 28, 2012 6:24 PM, Siddharth Tiwari siddharth.tiw...@live.com wrote: Hi Users. We have flat files on mainframes with around a billion records. We need to sort them and then use them with different jobs on mainframe for report generation. I was wondering was there any way I could integrate the mainframe with hadoop do the sorting and keep the file on the sever itself ( I do not want to ftp the file to a hadoop cluster and then ftp back the sorted file to Mainframe as it would waste MIPS and nullify the advantage ). This way I could save on MIPS and ultimately improve profitability. Thank you in advance **** *Cheers !!!* *Siddharth Tiwari* Have a refreshing day !!! *Every duty is holy, and devotion to duty is the highest form of worship of God.” * *Maybe other people will try to limit me but I don't limit myself*
Re: Hadoop and MainFrame integration
On 28 August 2012 09:24, Siddharth Tiwari siddharth.tiw...@live.com wrote: Hi Users. We have flat files on mainframes with around a billion records. We need to sort them and then use them with different jobs on mainframe for report generation. I was wondering was there any way I could integrate the mainframe with hadoop do the sorting and keep the file on the sever itself ( I do not want to ftp the file to a hadoop cluster and then ftp back the sorted file to Mainframe as it would waste MIPS and nullify the advantage ). This way I could save on MIPS and ultimately improve profitability. Can you NFS-mount the mainframe filesystem from the Hadoop cluster? Otherwise, do you or your mainframe vendor have a custom Hadoop filesystem binding for the mainframe? If not, you should be able to use ftp:// URLs as the source of data for the initial MR job; at the end of the sequence of MR jobs the result can go back to the mainframe;
RE: unsubscribe
Error: unsubscribe request failed. Please retry again during a full moon. From: Alberto Andreotti [mailto:albertoandreo...@gmail.com] Sent: Thursday, August 23, 2012 9:00 AM To: user@hadoop.apache.org Subject: unsubscribe unsubscribe -- José Pablo Alberto Andreotti. Tel: 54 351 4730292 Móvil: 54351156526363. MSN: albertoandreo...@gmail.commailto:albertoandreo...@gmail.com Skype: andreottialberto
Re: best way to join?
I don't mean that. I mean that a k-means clustering with pretty large clusters is a useful auxiliary data structure for finding nearest neighbors. The basic outline is that you find the nearest clusters and search those for near neighbors. The first riff is that you use a clever data structure for finding the nearest clusters so that you can do that faster than linear search. The second riff is when you use another clever data structure to search each cluster quickly. There are fancier data structures available as well. On Tue, Aug 28, 2012 at 12:04 PM, dexter morgan dextermorga...@gmail.comwrote: Right, but if i understood your sugesstion, you look at the end goal , which is: 1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]] for example, and you say: here we see a cluster basically, that cluster is represented by the point: [40.123,-50.432] which points does this cluster contains? [[41.431,- 43.32],[...,...],...,[...]] meaning: that for every point i have in the dataset, you create a cluster. If you don't mean that, but you do mean to create clusters based on some random-seed points or what not, that would mean that i'll have points (talking about the end goal) that won't have enough points in their list. one of the criterions for a clustering is that for any clusters: C_i and C_j (where i != j), C_i intersect C_j is empty and again, how can i accomplish my task with out running mahout / knn algo? just by calculating distance between points? join of a file with it self. Thanks On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning tdunn...@maprtech.comwrote: On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan dextermorga...@gmail.comwrote: I understand your solution ( i think) , didn't think of that, in that particular way. I think that lets say i have 1M data-points, and running knn , that the k=1M and n=10 (each point is a cluster that requires up to 10 points) is an overkill. I am not sure I understand you. n = number of points. k = number of clusters. For searching 1 million points, I would recommend thousands of clusters. How can i achieve the same result WITHOUT using mahout, just running on the dataset , i even think it'll be in the same complexity (o(n^2)) Running with a good knn package will give you roughly O(n log n) complexity.
copy Configuration into another
Its possible to copy apache.hadoop.conf.Configuration into another configuration object without creating a new instance? I am seeking something like new Configuration(Configuration) but without creating new destination object (its managed by spring)
Re: Hadoop and MainFrame integration
Can you read the data off backup tapes and dump it to flat files? Artem Ervits Data Analyst New York Presbyterian Hospital From: Marcos Ortiz [mailto:mlor...@uci.cu] Sent: Tuesday, August 28, 2012 06:51 PM To: user@hadoop.apache.org user@hadoop.apache.org Cc: Siddharth Tiwari siddharth.tiw...@live.com Subject: Re: Hadoop and MainFrame integration The problem with it, is that Hadoop depends on top of HDFS to storage in blocks of 64/128 MB of size (or the size that you determine, 64 MB is the de-facto size), and then make the calculations. So, you need to move all your data to a HDFS cluster to use data in MapReduce jobs if you want to make the calculations with Hadoop. Best wishes El 28/08/2012 12:24, Siddharth Tiwari escribió: Hi Users. We have flat files on mainframes with around a billion records. We need to sort them and then use them with different jobs on mainframe for report generation. I was wondering was there any way I could integrate the mainframe with hadoop do the sorting and keep the file on the sever itself ( I do not want to ftp the file to a hadoop cluster and then ftp back the sorted file to Mainframe as it would waste MIPS and nullify the advantage ). This way I could save on MIPS and ultimately improve profitability. Thank you in advance ** Cheers !!! Siddharth Tiwari Have a refreshing day !!! Every duty is holy, and devotion to duty is the highest form of worship of God.� Maybe other people will try to limit me but I don't limit myself [http://universidad.uci.cu/email.gif] http://www.uci.cu/ [http://universidad.uci.cu/email.gif] http://www.uci.cu/ This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited. If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message. Thank you. This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited. If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message. Thank you. This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited. If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message. Thank you.
RE: unsubscribe
HADOOP policy has changed. Any user wanting to unsubscribe needs to donate USD 100/- for Obamas campaign before the request is accepted. From: Georgi Georgiev [mailto:g.d.georg...@gmail.com] Sent: 29 August 2012 03:31 To: user@hadoop.apache.org Cc: Hennig, Ryan Subject: Re: unsubscribe I even got emails from people not in office, by sending the email bellow - thats crazy! g On Wed, Aug 29, 2012 at 12:56 AM, Georgi Georgiev g.d.georg...@gmail.com wrote: guys - whats going wrong with these request - cant you just teach people act appropriate - send regular mails to un-sub-subscribe - really a lot of spam in my in-mail. cheers, g On Wed, Aug 29, 2012 at 12:08 AM, Fabio Pitzolu fabio.pitz...@gmail.com wrote: Epic Ryan!!! Sent from my Windows Phone _ Da: Hennig, Ryan Inviato: 28/08/2012 21:14 A: user@hadoop.apache.org Oggetto: RE: unsubscribe Error: unsubscribe request failed. Please retry again during a full moon. From: Alberto Andreotti [mailto:albertoandreo...@gmail.com] Sent: Thursday, August 23, 2012 9:00 AM To: user@hadoop.apache.org Subject: unsubscribe unsubscribe -- José Pablo Alberto Andreotti. Tel: 54 351 4730292 tel:351%204730292 Móvil: 54351156526363. MSN: albertoandreo...@gmail.com Skype: andreottialberto _ No virus found in this message. Checked by AVG - www.avg.com Version: 2012.0.2197 / Virus Database: 2437/5231 - Release Date: 08/28/12
HBase and MapReduce data locality
I have been reading up on HBase and my understanding is that the physical files on the HDFS are split first by region and then by column families. Thus each column family has its own physical file (on a per-region basis). If I run a MapReduce task that uses the HBase as input, wouldn't this imply that if the task reads from more than 1 column family the data for that row might not be (entirely) local to the task? Is there a way to tell the HDFS to keep blocks of each region's column families together?
Re: MRBench Maps strange behaviour
Hi, The number of maps specified to any map reduce program (including those part of MRBench) is generally only a hint, and the actual number of maps will be influenced in typical cases by the amount of data being processed. You can take a look at this wiki link to understand more: http://wiki.apache.org/hadoop/HowManyMapsAndReduces In the examples below, since the data you've generated is different, the number of mappers are different. To be able to judge your benchmark results, you'd need to benchmark against the same data (or at least same type of type - i.e. size and type). The number of maps printed at the end is straight from the input specified and doesn't reflect what the job actually ran with. The information from the counters is the right one. Thanks Hemanth On Tue, Aug 28, 2012 at 4:02 PM, Gaurav Dasgupta gdsay...@gmail.com wrote: Hi All, I executed the MRBench program from hadoop-test.jar in my 12 node CDH3 cluster. After executing, I had some strange observations regarding the number of Maps it ran. First I ran the command: hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 3 -maps 200 -reduces 200 -inputLines 1024 -inputType random And I could see that the actual number of Maps it ran was 201 (for all the 3 runs) instead of 200 (Though the end report displays the launched to be 200). Here is the console report: 12/08/28 04:34:35 INFO mapred.JobClient: Job complete: job_201208230144_0035 12/08/28 04:34:35 INFO mapred.JobClient: Counters: 28 12/08/28 04:34:35 INFO mapred.JobClient: Job Counters 12/08/28 04:34:35 INFO mapred.JobClient: Launched reduce tasks=200 12/08/28 04:34:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=617209 12/08/28 04:34:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/08/28 04:34:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/08/28 04:34:35 INFO mapred.JobClient: Rack-local map tasks=137 12/08/28 04:34:35 INFO mapred.JobClient: Launched map tasks=201 12/08/28 04:34:35 INFO mapred.JobClient: Data-local map tasks=64 12/08/28 04:34:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1756882 Again, I ran the MRBench for just 10 Maps and 10 Reduces: hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -maps 10 -reduces 10 This time the actual number of Maps were only 2 and again the end report displays Maps Lauched to be 10. The console output: 12/08/28 05:05:35 INFO mapred.JobClient: Job complete: job_201208230144_0040 12/08/28 05:05:35 INFO mapred.JobClient: Counters: 27 12/08/28 05:05:35 INFO mapred.JobClient: Job Counters 12/08/28 05:05:35 INFO mapred.JobClient: Launched reduce tasks=20 12/08/28 05:05:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6648 12/08/28 05:05:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/08/28 05:05:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/08/28 05:05:35 INFO mapred.JobClient: Launched map tasks=2 12/08/28 05:05:35 INFO mapred.JobClient: Data-local map tasks=2 12/08/28 05:05:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=163257 12/08/28 05:05:35 INFO mapred.JobClient: FileSystemCounters 12/08/28 05:05:35 INFO mapred.JobClient: FILE_BYTES_READ=407 12/08/28 05:05:35 INFO mapred.JobClient: HDFS_BYTES_READ=258 12/08/28 05:05:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1072596 12/08/28 05:05:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=3 12/08/28 05:05:35 INFO mapred.JobClient: Map-Reduce Framework 12/08/28 05:05:35 INFO mapred.JobClient: Map input records=1 12/08/28 05:05:35 INFO mapred.JobClient: Reduce shuffle bytes=647 12/08/28 05:05:35 INFO mapred.JobClient: Spilled Records=2 12/08/28 05:05:35 INFO mapred.JobClient: Map output bytes=5 12/08/28 05:05:35 INFO mapred.JobClient: CPU time spent (ms)=17070 12/08/28 05:05:35 INFO mapred.JobClient: Total committed heap usage (bytes)=6218842112 12/08/28 05:05:35 INFO mapred.JobClient: Map input bytes=2 12/08/28 05:05:35 INFO mapred.JobClient: Combine input records=0 12/08/28 05:05:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=254 12/08/28 05:05:35 INFO mapred.JobClient: Reduce input records=1 12/08/28 05:05:35 INFO mapred.JobClient: Reduce input groups=1 12/08/28 05:05:35 INFO mapred.JobClient: Combine output records=0 12/08/28 05:05:35 INFO mapred.JobClient: Physical memory (bytes) snapshot=3348828160 12/08/28 05:05:35 INFO mapred.JobClient: Reduce output records=1 12/08/28 05:05:35 INFO mapred.JobClient: Virtual memory (bytes) snapshot=22955810816 12/08/28 05:05:35 INFO mapred.JobClient: Map output records=1 DataLines Maps Reduces AvgTime (milliseconds) 120 20 17451 Can some one please help me understand this behaviour of Hadoop in