Re: Multiple Output Format -Unrecognizable Characters in Output File
Hi James, Not sure if you meant to write both key and value as text. key.write(output); This line of code writes long numbers as binary format, that might be the reason you saw unrecognizable characters in output file. Yaozhen On Mon, Jul 18, 2011 at 2:00 PM, Teng, James xt...@ebay.com wrote: ** ** Hi, I encounter a problem why try to define my own MultipleOutputFormat class, here is the codes bellow. *public* *class* MultipleOutputFormat *extends*FileOutputFormatLongWritable,Text{ *public* *class* LineWriter *extends*RecordWriterLongWritable,Text{ *private* DataOutputStream output; *private* *byte* *separatorBytes*[]; *public* LineWriter(DataOutputStream output, String separator) *throws* UnsupportedEncodingException { *this*.output=output; *this*.separatorBytes=separator.getBytes(UTF-8); } @Override *public* *synchronized* *void* close(TaskAttemptContext context) *throws* IOException, InterruptedException { // *TODO* Auto-generated method stub output.close(); } ** ** @Override *public* *void* write(LongWritable key, Text value) *throws*IOException, InterruptedException { System.*out*.println(key:+key.get()); System.*out*.println(value:+value.toString()); // *TODO* Auto-generated method stub //output.writeLong(key.) //output.write(separatorBytes); //output.write(value.toString().getBytes(UTF-8)); //output.write(\n.getBytes(UTF-8)); //key.write(output); key.write(output); value.write(output); ** ** output.write(\n.getBytes(UTF-8)); } } *private* Path *path*; *protected* String generateFileNameForKeyValue(LongWritable key,Text value,String name) { *return* key+Math.*random*(); } ** ** @Override *public* RecordWriterLongWritable, Text getRecordWriter( TaskAttemptContext context) *throws* IOException, InterruptedException { path=*getOutputPath*(context); System.*out*.println( d ); // *TODO* Auto-generated method stub Path file = getDefaultWorkFile(context, ); FileSystem fs = file.getFileSystem(context.getConfiguration()); ** ** FSDataOutputStream fileOut = fs.create(file, *false*); ** ** *return* *new* LineWriter(fileOut, \t); ** ** } ** ** however, there is a problem of unrecognizable characters occurrences in the output file, is there any one encounter the problem before, any comment is greatly appreciated, thanks in advance. ** ** *James, Teng (Teng Linxiao)* *eRL, CDC,eBay,Shanghai* *Extension*:86-21-28913530 *MSN*: tenglinx...@hotmail.com *Skype*:James,Teng *Email*:xt...@ebay.com
Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster
Narayanan, Regarding the client installation, you should make sure that client and server use same version hadoop for submitting jobs and transfer data. if you use a different user in client than the one runs hadoop job, config the hadoop ugi property (sorry i forget the exact name). 在 2011 7 1 15:28,Narayanan K knarayana...@gmail.com写道: Hi Harsh Thanks for the quick response... Have a few clarifications regarding the 1st point : Let me tell the background first.. We have actually set up a Hadoop cluster with HBase installed. We are planning to load Hbase with data and perform some computations with the data and show up the data in a report format. The report should be accessible from outside the cluster and the report accepts certain parameters to show data, that will in turn pass on these parameters to the hadoop master server where a mapreduce job will be run that queries HBase to retrieve the data. So the report will be run from a different machine outside the cluster. So we need a way to pass on the parameters to the hadoop cluster (master) and initiate a mapreduce job dynamically. Similarly the output of mapreduce job needs to tunneled into the machine from where the report was run. Some more clarification I need is : Does the machine (outside of cluster) which ran the report, require something like a Client installation which will talk with the Hadoop Master Server via TCP??? Or can it can run a job in hadoop server by using a passworldless scp to the master machine or something of the like. Regards, Narayanan On Fri, Jul 1, 2011 at 11:41 AM, Harsh J ha...@cloudera.com wrote: Narayanan, On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K knarayana...@gmail.com wrote: Hi all, We are basically working on a research project and I require some help regarding this. Always glad to see research work being done! What're you working on? :) How do I submit a mapreduce job from outside the cluster i.e from a different machine outside the Hadoop cluster? If you use Java APIs, use the Job#submit(…) method and/or JobClient.runJob(…) method. Basically Hadoop will try to create a jar with all requisite classes within and will push it out to the JobTracker's filesystem (HDFS, if you run HDFS). From there on, its like a regular operation. This even happens on the Hadoop nodes itself, so doing so from an external place as long as that place has access to Hadoop's JT and HDFS, should be no different at all. If you are packing custom libraries along, don't forget to use DistributedCache. If you are packing custom MR Java code, don't forget to use Job#setJarByClass/JobClient#setJarByClass and other appropriate API methods. If the above can be done, How can I schedule map reduce jobs to run in hadoop like crontab from a different machine? Are there any webservice APIs that I can leverage to access a hadoop cluster from outside and submit jobs or read/write data from HDFS. For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/ It is well supported and is very useful in writing MR workflows (which is a common requirement). You also get coordinator features and can schedule similar to crontab functionalities. For HDFS r/w over web, not sure of an existing web app specifically for this purpose without limitations, but there is a contrib/thriftfs you can leverage upon (if not writing your own webserver in Java, in which case its as simple as using HDFS APIs). Also have a look at the pretty mature Hue project which aims to provide a great frontend that lets you design jobs, submit jobs, monitor jobs and upload files or browse the filesystem (among several other things): http://cloudera.github.com/hue/ -- Harsh J
Does hadoop local mode support running multiple jobs in different threads?
Hi, I am not sure if this question (as title) has been asked before, but I didn't find an answer by googling. I'd like to explain the scenario of my problem: My program launches several threads in the same time, while each thread will submit a hadoop job and wait for the job to complete. The unit tests were run in local mode, mini-cluster and the real hadoop cluster. I found the unit tests may fail in local mode, but they always succeeded in mini-cluster and real hadoop cluster. When unit test failed in local mode, the causes may be different (stack traces are posted at the end of mail). It seems multi-thead running multiple jobs is not supported in local mode, is it? Error 1: 2011-07-01 20:24:36,460 WARN [Thread-38] mapred.LocalJobRunner (LocalJobRunner.java:run(256)) - job_local_0001 java.io.FileNotFoundException: File build/test/tmp/mapred/local/taskTracker/jobcache/job_local_0001/attempt_local_0001_m_00_0/output/spill0.out does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142) at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1447) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:549) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:623) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Error 2: 2011-07-01 19:00:25,546 INFO [Thread-32] fs.FSInputChecker (FSInputChecker.java:readChecksumChunk(247)) - Found checksum error: b[3584, 4096]=696f6e69643c2f6e616d653e3c76616c75653e47302e4120636f696e636964656e63652047312e413c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e6d61707265642e6a6f622e747261636b65722e706572736973742e6a6f627374617475732e6469723c2f6e616d653e3c76616c75653e2f6a6f62747261636b65722f6a6f6273496e666f3c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e6d61707265642e6a61723c2f6e616d653e3c76616c75653e66696c653a2f686f6d652f70616e797a682f6861646f6f7063616c632f6275696c642f746573742f746d702f6d61707265642f73797374656d2f6a6f625f6c6f63616c5f303030332f6a6f622e6a61723c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e66732e73332e627565722e6469723c2f6e616d653e3c76616c75653e247b6861646f6f702e746d702e6469727d2f7c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e6a6f622e656e642e72657472792e617474656d7074733c2f6e616d653e3c76616c75653e303c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e66732e66696c652e696d706c3c2f6e616d653e3c76616c75653e6f org.apache.hadoop.fs.ChecksumException: Checksum error: file:/home/hadoop-user/hadoop-proj/build/test/tmp/mapred/system/job_local_0003/job.xml at 3584 at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158) at java.io.DataInputStream.read(DataInputStream.java:83) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:49) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:87) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:209) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142) at org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:61) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1197) at org.apache.hadoop.mapred.LocalJobRunner$Job.init(LocalJobRunner.java:92) at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:448) at hadoop.GroupingRunnable.run(GroupingRunnable.java:126) at java.lang.Thread.run(Thread.java:619)
Re: Hadoop eclipse plugin stopped working after replacing hadoop-0.20.2 jar files with hadoop-0.20-append jar files
Hi, I am using Eclipse Helios Service Release 2. I encountered a similar problem (map/reduce perspective failed to load) when upgrading eclipse plugin from 0.20.2 to 0.20.3-append version. I compared the source code of eclipse plugin and found only a few difference. I tried to revert the differences one by one to see if it can work. What surprised me was that when I only reverted the jar name from hadoop-0.20.3-eclipse-plugin.jar to hadoop-0.20.2-eclipse-plugin.jar, it worked in eclipse. Yaozhen On Thu, Jun 23, 2011 at 1:22 AM, praveenesh kumar praveen...@gmail.comwrote: I am doing that.. its not working.. If I am replacing the hadoop-core from hadoop-plugin.jar.. I am not able to see map-reduce perspective at all. Guys.. any help.. !!! Thanks, Praveenesh On Wed, Jun 22, 2011 at 12:34 PM, Devaraj K devara...@huawei.com wrote: Every time when hadoop builds, it also builds the hadoop eclipse plug-in using the latest hadoop core jar. In your case eclipse plug-in contains the other version jar and cluster is running with other version. That's why it is giving the version mismatch error. Just replace the hadoop-core jar in your eclipse plug-in with the jar whatever the hadoop cluster is using and check. Devaraj K _ From: praveenesh kumar [mailto:praveen...@gmail.com] Sent: Wednesday, June 22, 2011 12:07 PM To: common-user@hadoop.apache.org; devara...@huawei.com Subject: Re: Hadoop eclipse plugin stopped working after replacing hadoop-0.20.2 jar files with hadoop-0.20-append jar files I followed michael noll's tutorial for making hadoop-0-20-append jars.. http://www.michael-noll.com/blog/2011/04/14/building-an-hadoop-0-20-x-versio n-for-hbase-0-90-2/ After following the article.. we get 5 jar files which we need to replace it from hadoop.0.20.2 jar file. There is no jar file for hadoop-eclipse plugin..that I can see in my repository if I follow that tutorial. Also the hadoop-plugin I am using..has no info on JIRA MAPREDUCE-1280 regarding whether it is compatible with hadoop-0.20-append. Does anyone else. faced this kind of issue ??? Thanks, Praveenesh On Wed, Jun 22, 2011 at 11:48 AM, Devaraj K devara...@huawei.com wrote: Hadoop eclipse plugin also uses hadoop-core.jar file communicate to the hadoop cluster. For this it needs to have same version of hadoop-core.jar for client as well as server(hadoop cluster). Update the hadoop eclipse plugin for your eclipse which is provided with hadoop-0.20-append release, it will work fine. Devaraj K -Original Message- From: praveenesh kumar [mailto:praveen...@gmail.com] Sent: Wednesday, June 22, 2011 11:25 AM To: common-user@hadoop.apache.org Subject: Hadoop eclipse plugin stopped working after replacing hadoop-0.20.2 jar files with hadoop-0.20-append jar files Guys, I was using hadoop eclipse plugin on hadoop 0.20.2 cluster.. It was working fine for me. I was using Eclipse SDK Helios 3.6.2 with the plugin hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar downloaded from JIRA MAPREDUCE-1280 Now for Hbase installation.. I had to use hadoop-0.20-append compiled jars..and I had to replace the old jar files with new 0.20-append compiled jar files.. But now after replacing .. my hadoop eclipse plugin is not working well for me. Whenever I am trying to connect to my hadoop master node from that and try to see DFS locations.. it is giving me the following error: * Error : Protocol org.apache.hadoop.hdfs.protocol.clientprotocol version mismatch (client 41 server 43)* However the hadoop cluster is working fine if I go directly on hadoop namenode use hadoop commands.. I can add files to HDFS.. run jobs from there.. HDFS web console and Map-Reduce web console are also working fine. but not able to use my previous hadoop eclipse plugin. Any suggestions or help for this issue ? Thanks, Praveenesh
Re: Hadoop eclipse plugin stopped working after replacing hadoop-0.20.2 jar files with hadoop-0.20-append jar files
Hi, Our hadoop version was built on 0.20-append with a few patches. However, I didn't see big differences in eclipse-plugin. Yaozhen On Thu, Jun 23, 2011 at 11:29 AM, 叶达峰 (Jack Ye) kobe082...@qq.com wrote: do you use hadoop 0.20.203.0? I also have problem about this plugin. Yaozhen Pan itzhak@gmail.com编写: Hi, I am using Eclipse Helios Service Release 2. I encountered a similar problem (map/reduce perspective failed to load) when upgrading eclipse plugin from 0.20.2 to 0.20.3-append version. I compared the source code of eclipse plugin and found only a few difference. I tried to revert the differences one by one to see if it can work. What surprised me was that when I only reverted the jar name from hadoop-0.20.3-eclipse-plugin.jar to hadoop-0.20.2-eclipse-plugin.jar, it worked in eclipse. Yaozhen On Thu, Jun 23, 2011 at 1:22 AM, praveenesh kumar praveen...@gmail.com wrote: I am doing that.. its not working.. If I am replacing the hadoop-core from hadoop-plugin.jar.. I am not able to see map-reduce perspective at all. Guys.. any help.. !!! Thanks, Praveenesh On Wed, Jun 22, 2011 at 12:34 PM, Devaraj K devara...@huawei.com wrote: Every time when hadoop builds, it also builds the hadoop eclipse plug-in using the latest hadoop core jar. In your case eclipse plug-in contains the other version jar and cluster is running with other version. That's why it is giving the version mismatch error. Just replace the hadoop-core jar in your eclipse plug-in with the jar whatever the hadoop cluster is using and check. Devaraj K _ From: praveenesh kumar [mailto:praveen...@gmail.com] Sent: Wednesday, June 22, 2011 12:07 PM To: common-user@hadoop.apache.org; devara...@huawei.com Subject: Re: Hadoop eclipse plugin stopped working after replacing hadoop-0.20.2 jar files with hadoop-0.20-append jar files I followed michael noll's tutorial for making hadoop-0-20-append jars.. http://www.michael-noll.com/blog/2011/04/14/building-an-hadoop-0-20-x-versio n-for-hbase-0-90-2/ After following the article.. we get 5 jar files which we need to replace it from hadoop.0.20.2 jar file. There is no jar file for hadoop-eclipse plugin..that I can see in my repository if I follow that tutorial. Also the hadoop-plugin I am using..has no info on JIRA MAPREDUCE-1280 regarding whether it is compatible with hadoop-0.20-append. Does anyone else. faced this kind of issue ??? Thanks, Praveenesh On Wed, Jun 22, 2011 at 11:48 AM, Devaraj K devara...@huawei.com wrote: Hadoop eclipse plugin also uses hadoop-core.jar file communicate to the hadoop cluster. For this it needs to have same version of hadoop-core.jar for client as well as server(hadoop cluster). Update the hadoop eclipse plugin for your eclipse which is provided with hadoop-0.20-append release, it will work fine. Devaraj K -Original Message- From: praveenesh kumar [mailto:praveen...@gmail.com] Sent: Wednesday, June 22, 2011 11:25 AM To: common-user@hadoop.apache.org Subject: Hadoop eclipse plugin stopped working after replacing hadoop-0.20.2 jar files with hadoop-0.20-append jar files Guys, I was using hadoop eclipse plugin on hadoop 0.20.2 cluster.. It was working fine for me. I was using Eclipse SDK Helios 3.6.2 with the plugin hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar downloaded from JIRA MAPREDUCE-1280 Now for Hbase installation.. I had to use hadoop-0.20-append compiled jars..and I had to replace the old jar files with new 0.20-append compiled jar files.. But now after replacing .. my hadoop eclipse plugin is not working well for me. Whenever I am trying to connect to my hadoop master node from that and try to see DFS locations.. it is giving me the following error: * Error : Protocol org.apache.hadoop.hdfs.protocol.clientprotocol version mismatch (client 41 server 43)* However the hadoop cluster is working fine if I go directly on hadoop namenode use hadoop commands.. I can add files to HDFS.. run jobs from there.. HDFS web console and Map-Reduce web console are also working fine. but not able to use my previous hadoop eclipse plugin. Any suggestions or help for this issue ? Thanks, Praveenesh
Re: Make reducer task exit early
It can be achieved by overwriting Reducer.run() in new mapreduce API. But I don't know how to achieve it in old API. On Sat, Jun 4, 2011 at 8:14 AM, Aaron Baff aaron.b...@telescope.tv wrote: Is there a way to make a Reduce task exit early before it has finished reading all of it's data? Basically I'm doing a group by with a sum, and I only want to return the top 1000 records say. So I have local class int variable to keep track of how many have current been written to the output, and as soon as that is exceeded, simply return at the top of the reduce() function. Is there any way to optimize it even more to tell the Reduce task, stop reading data, I don't need any more data? --Aaron
Re: Unable to start hadoop-0.20.2 but able to start hadoop-0.20.203 cluster
How many datanodes are in your cluster? and what is the value of dfs.replication in hdfs-site.xml (if not specified, default value is 3)? From the error log, it seems there are not enough datanodes to replicate the files in hdfs. 在 2011 5 31 22:23,Harsh J ha...@cloudera.com写道: Xu, Please post the output of `hadoop dfsadmin -report` and attach the tail of a started DN's log? On Tue, May 31, 2011 at 7:44 PM, Xu, Richard richard...@citi.com wrote: 2. Also, Configured Cap... This might easily be the cause. I'm not sure if its a Solaris thing that can lead to this though. 3. in datanode server, no error in logs, but tasktracker logs has the following suspicious thing:... I don't see any suspicious log message in what you'd posted. Anyhow, the TT does not matter here. -- Harsh J