Newbie InputFormat Question
I want to alter the default <"key", "line"> input format to be <"key", "line number:" + "line"> so that my mapper can have a reference to the line num. It seems like this should be easy by overwriting either inputformat or inputsplit... but after reading some of the docs, I am still unsure of where to begin. Any help is much appreciated. -Matt -- View this message in context: http://www.nabble.com/Newbie-InputFormat-Question-tp17120981p17120981.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Fwd: Collecting output not to file
To clarify: static class TestOutputFormat implements OutputFormat { static class TestRecordWriter implements RecordWriter { TestOutputFormat output; public TestRecordWriter (TestOutputFormat output, org.apache.hadoop.fs.FileSystem ignored, JobConf job, String name, Progressable progress) { this.output = output; } public void close (Reporter reporter) {} public void write (Text key, Text value) { output.addResults (value.toString ()); } } protected String results = ""; public void checkOutputSpecs (org.apache.hadoop.fs.FileSystem ignored, JobConf job) throws IOException {} public RecordWriter getRecordWriter (org.apache.hadoop.fs.FileSystem ignored, JobConf job, String name, Progressable progress) { return new TestRecordWriter (this, ignored, job, name, progress); } public void addResults (String r) { results += r + ","; } public String getResults () { return results; } } And then running the task: public int run(String[] args) throws Exception { JobClient.runJob(job); // getOutputFormatcreates a new instance of the outputformat. I want to get the instance of the output format that the reduce function wrote to // The recordWriter that reduce wrote to would be just as good TestOutputFormat results = (TestOutputFormat) job.getOutputFormat (); // Always prints the empty string, not the populated results System.out.println ("results: " + results.getResults ()); return 0; } Derek Shaw <[EMAIL PROTECTED]> wrote: Date: Tue, 6 May 2008 23:26:30 -0400 (EDT) From: Derek Shaw <[EMAIL PROTECTED]> Subject: Collecting output not to file To: core-user@hadoop.apache.org Hey, >From the examples that I have seen thus far, all of the results from the >reduce function are being written to a file. Instead of writing results to a >file, I want to store them and inspect them after the job is completed. (I >think that I need to implement my own OutputCollector, but I don't know how to >tell hadoop to use it.) How can I do this? -Derek
Re: Hadoop Permission Problem
Hi Senthil, Since the path "myapps" is relative, copyFromLocal will copy the file to the home directory, i.e. /user/Test/myapps in your case. If /user/Test doesn't not exist, it will first try to create it. You got AccessControlException because the permission of /user is 755. Hope this helps. Nicholas - Original Message From: "Natarajan, Senthil" <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Wednesday, May 7, 2008 2:36:22 PM Subject: Hadoop Permission Problem Hi, My datanode and jobtracker are started by user "hadoop". And user "Test" needs to submit the job. So if the user "Test" copies file to HDFS, there is a permission error. /usr/local/hadoop/bin/hadoop dfs -copyFromLocal /home/Test/somefile.txt myapps copyFromLocal: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=Test, access=WRITE, inode="user":hadoop:supergroup:rwxr-xr-x Could you please let me know how other users (other than hadoop) can access HDFS and then submit MapReduce jobs. Where to configure or what default configuration needs to be changed. Thanks, Senthil
Re: Read timed out, Abandoning block blk_-5476242061384228962
Hi James Were you able to start all the nodes in the same 'availability zone'? You using the new AMI kernels? If you are using the contrib/ec2 scripts, you might upgrade (just the scripts) to http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.17/src/contrib/ec2/ These support the new kernels and availability zones. My transient errors went away when upgrading. The functional changes are documented here: http://wiki.apache.org/hadoop/AmazonEC2 fyi, you will need to build your own images (via the create-image command) with whatever version of Hadoop you are comfortable with. this will also get you a Ganglia install... ckw On May 7, 2008, at 1:29 PM, James Moore wrote: What is this bit of the log trying to tell me, and what sorts of things should I be looking at to make sure it doesn't happen? I don't think the network has any basic configuration issues - I can telnet from the machine creating this log to the destination - telnet 10.252.222.239 50010 works fine when I ssh in to the box with this error. 2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: Read timed out 2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-5476242061384228962 2008-05-07 13:20:31,196 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.222.239:50010 I'm seeing a fair number of these. My reduces finally complete, but there are usually a couple at the end that take longer than I think they should, and they frequently have these sorts of errors. I'm running 20 machines on ec2 right now, with hadoop version 0.16.4. -- James Moore | [EMAIL PROTECTED] blog.restphone.com Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Read timed out, Abandoning block blk_-5476242061384228962
Taking the timeout out is very dangerous. It may cause your application to hang. You could change the timeout parameter to a larger number. HADOOP-2188 fixed the problem. Check https://issues.apache.org/jira/browse/HADOOP-2188. Hairong On 5/7/08 2:36 PM, "James Moore" <[EMAIL PROTECTED]> wrote: > I noticed that there was a hard-coded timeout value of 6000 (ms) in > src/java/org/apache/hadoop/dfs/DFSClient.java - as an experiment, I > took that way down and now I'm not noticing the problem. (Doesn't > mean it's not there, I just don't feel the pain...) > > This feels like a terrible solution^H^H^H^H^H^hack though, > particularly since I haven't yet taken the time to actually understand > the code.
Hadoop Permission Problem
Hi, My datanode and jobtracker are started by user "hadoop". And user "Test" needs to submit the job. So if the user "Test" copies file to HDFS, there is a permission error. /usr/local/hadoop/bin/hadoop dfs -copyFromLocal /home/Test/somefile.txt myapps copyFromLocal: org.apache.hadoop.fs.permission.AccessControlException: Permission denied: user=Test, access=WRITE, inode="user":hadoop:supergroup:rwxr-xr-x Could you please let me know how other users (other than hadoop) can access HDFS and then submit MapReduce jobs. Where to configure or what default configuration needs to be changed. Thanks, Senthil
Re: Read timed out, Abandoning block blk_-5476242061384228962
I noticed that there was a hard-coded timeout value of 6000 (ms) in src/java/org/apache/hadoop/dfs/DFSClient.java - as an experiment, I took that way down and now I'm not noticing the problem. (Doesn't mean it's not there, I just don't feel the pain...) This feels like a terrible solution^H^H^H^H^H^hack though, particularly since I haven't yet taken the time to actually understand the code. -- James Moore | [EMAIL PROTECTED] blog.restphone.com
Read timed out, Abandoning block blk_-5476242061384228962
What is this bit of the log trying to tell me, and what sorts of things should I be looking at to make sure it doesn't happen? I don't think the network has any basic configuration issues - I can telnet from the machine creating this log to the destination - telnet 10.252.222.239 50010 works fine when I ssh in to the box with this error. 2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: Read timed out 2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-5476242061384228962 2008-05-07 13:20:31,196 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node: 10.252.222.239:50010 I'm seeing a fair number of these. My reduces finally complete, but there are usually a couple at the end that take longer than I think they should, and they frequently have these sorts of errors. I'm running 20 machines on ec2 right now, with hadoop version 0.16.4. -- James Moore | [EMAIL PROTECTED] blog.restphone.com
Reduce task is stalled. Just wont execute
Hi, I have been trying word count example distributed with Hadoop 0.16.3. It works fine on single machine mode. But the moment i add an extra slave reduce phase stalls. I Get following message in my logs SLAVE . 2008-05-07 23:37:27,860 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:37:33,862 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:37:39,864 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:37:42,866 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:37:48,868 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:37:54,870 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:38:00,872 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:38:03,873 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:38:09,875 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:38:15,876 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:38:18,878 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:38:24,880 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:38:30,882 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:38:33,883 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:39:18,898 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:39:24,900 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:39:30,902 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:39:33,903 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:39:39,905 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:39:45,907 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:39:48,908 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:39:54,910 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:40:00,912 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:40:03,913 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:40:09,915 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:40:15,917 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:40:18,919 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:40:24,921 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.03 MB/s) > 2008-05-07 23:40:27,120 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction: task_200805080929_0001_m_01_1 2008-05-07 23:40:28,705 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_m_01_1 1.0% hdfs://master:54310/user/hadoop/d3:337381+337381 2008-05-07 23:40:28,708 INFO org.apache.hadoop.mapred.TaskTracker: Task task_200805080929_0001_m_01_1 is done. 2008-05-07 23:40:30,923 INFO org.apache.hadoop.mapred.TaskTracker: task_200805080929_0001_r_00_0 0.1
Re: Where is the files?
DFS files are mapped into blocks. Blocks are stored under dfs.data.dir/current. Hairong On 5/7/08 7:36 AM, "hong" <[EMAIL PROTECTED]> wrote: > Hi All, > > I started Hadoop in standalone mode, and put some file on to HDSF. I > strictly followed the instructions in Hadoop Quick Start. > > HDSF is mapped to a local directory in my local file system, right? > and where is it? > > Thank you in advance! > >
Re: Not allow file split
On May 7, 2008, at 6:30 AM, Roberto Zandonati wrote: Hi at all, I'm a newbie and I have the following problem. I need to implement an InputFormat such that the isSplitable always returns false ah shown in http://wiki.apache.org/hadoop/FAQ (question no 10). And here there is the problem. I have also to implement the RecordReader interface for returning the whole content of the input file but I don't know how. I have found only examples that uses the LineRecordReader Couple of things. 1. Take a look at SequenceFileRecordReader: http://svn.apache.org/ viewvc/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/ SequenceFileRecordReader.java?view=log 2. If you just want to process a text file as a while or a sequence file as whole (or any existing one) you do not need to implement a 'RecordReader' at all. Just sub-class the InputFormat, override the isSplittable and the RecordReader will work correctly. Take a look at SortValidtor (http://svn.apache.org/viewvc/hadoop/core/trunk/src/test/ org/apache/hadoop/mapred/SortValidator.java) and how it sub-classes SequenceFileInputFormat to implement a NonSplittableSequenceFileInputFormat. Arun
Re: Where is the files?
it will be mapped to /tmp <--> equivalanet to /tmp in windows Regards, -Vikas. On Wed, May 7, 2008 at 8:06 PM, hong <[EMAIL PROTECTED]> wrote: > Hi All, > > I started Hadoop in standalone mode, and put some file on to HDSF. I > strictly followed the instructions in Hadoop Quick Start. > > HDSF is mapped to a local directory in my local file system, right? and > where is it? > > Thank you in advance! > > >
Where is the files?
Hi All, I started Hadoop in standalone mode, and put some file on to HDSF. I strictly followed the instructions in Hadoop Quick Start. HDSF is mapped to a local directory in my local file system, right? and where is it? Thank you in advance!
Re: Not allow file split
You can implement a custom input format and a record reader. Assuming your record data type is class RecType, the input format should subclass FileInputFormat< LongWritable, RecType > and the record reader should implement RecordReader < LongWritable, RecType > In this case the key could be the offset into the file, although it is not very useful since you treat the entire file as one record. The isSplitable() method in the input format should return false. The RecordReader.next( LongWritable pos, RecType val ) method should read the entire file and set val to the file contents. This will ensure that the entire file goes to one map task as a single record. -Rahul Sood [EMAIL PROTECTED] > Hi at all, I'm a newbie and I have the following problem. > > I need to implement an InputFormat such that the isSplitable always > returns false ah shown in http://wiki.apache.org/hadoop/FAQ (question > no 10). > And here there is the problem. > > I have also to implement the RecordReader interface for returning the > whole content of the input file but I don't know how. I have found > only examples that uses the LineRecordReader > > Someone can help me? > > Thanks >
Re: Collecting output not to file
Good point. I want to put the results of the reduce function in a multimap instead of writing them to a file. -Derek Amar Kamat <[EMAIL PROTECTED]> wrote: Derek Shaw wrote: > Hey, > > From the examples that I have seen thus far, all of the results from the > reduce function are being written to a file. Instead of writing results to a > file, I want to store them What do you mean by "store and inspect"? > and inspect them after the job is completed. (I think that I need to > implement my own OutputCollector, but I don't know how to tell hadoop to use > it.) How can I do this? > > -Derek > >
Re: single node Hbase
Try this one http://hadoop.apache.org/hbase/docs/r0.1.1/api/overview-summary.html#overview_description - Yuri. On Wed, May 7, 2008 at 4:40 PM, Ahmed Shiraz Memon < [EMAIL PROTECTED]> wrote: > the link is not working... > Shiraz > > On Mon, Mar 17, 2008 at 9:34 PM, stack <[EMAIL PROTECTED]> wrote: > > > Try our 'getting started': > > http://hadoop.apache.org/hbase/docs/current/api/index.html. > > St.Ack > > > > > > > > Peter W. wrote: > > > > > Hello, > > > > > > Are there any Hadoop documentation resources showing > > > how to run the current version of Hbase on a single node? > > > > > > Thanks, > > > > > > Peter W. > > > > > > > >
Not allow file split
Hi at all, I'm a newbie and I have the following problem. I need to implement an InputFormat such that the isSplitable always returns false ah shown in http://wiki.apache.org/hadoop/FAQ (question no 10). And here there is the problem. I have also to implement the RecordReader interface for returning the whole content of the input file but I don't know how. I have found only examples that uses the LineRecordReader Someone can help me? Thanks -- Roberto Zandonati
Re: single node Hbase
the link is not working... Shiraz On Mon, Mar 17, 2008 at 9:34 PM, stack <[EMAIL PROTECTED]> wrote: > Try our 'getting started': > http://hadoop.apache.org/hbase/docs/current/api/index.html. > St.Ack > > > > Peter W. wrote: > > > Hello, > > > > Are there any Hadoop documentation resources showing > > how to run the current version of Hbase on a single node? > > > > Thanks, > > > > Peter W. > > > >
Re: Collecting output not to file
Derek Shaw wrote: Hey, From the examples that I have seen thus far, all of the results from the reduce function are being written to a file. Instead of writing results to a file, I want to store them What do you mean by "store and inspect"? and inspect them after the job is completed. (I think that I need to implement my own OutputCollector, but I don't know how to tell hadoop to use it.) How can I do this? -Derek
Re: Collecting output not to file
(I think that I need to implement my own OutputCollector, but I don't know how to tell hadoop to use it.) How can I do this? -Derek You probably need to define your own OutputFormat and tell Hadoop to use it by calling setOutputFormat method of JobConf. OutputFormat instance is used to create RecordWriter instance which is used by OutputCollector to process output data. You may want to take a look at implementation of SequenceFileOutputFormat for example
Re: How to write simple programs using Hadoop?
On May 7, 2008, at 12:33 AM, Hadoop wrote: Is there any chance to see some simple programs for Hadoop (such as Hello world, counting numbers 1-10, reading two numbers and printing the larger one, other number, string and file processing examples,...etc) written in Java/C++. It seems that the only available public code on the world (Internet) is the WordCount program. I learn programming easily and faster by examples and I would appreciate it if anyone can share some simple programs written in Java/C++ for Hadoop . If there is any manuals, examples, links about writing programs for Hadoop, please share it. Take a look at the src/examples directory in your hadoop distribution: http://svn.apache.org/viewvc/hadoop/core/trunk/src/examples/org/ apache/hadoop/examples/ and http://svn.apache.org/viewvc/hadoop/core/trunk/src/examples/pipes/impl/ Map-Reduce tutorial: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html Hadoop Streaming: http://hadoop.apache.org/core/docs/current/streaming.html Arun -- View this message in context: http://www.nabble.com/How-to-write- simple-programs-using-Hadoop--tp17099073p17099073.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
How to write simple programs using Hadoop?
Is there any chance to see some simple programs for Hadoop (such as Hello world, counting numbers 1-10, reading two numbers and printing the larger one, other number, string and file processing examples,...etc) written in Java/C++. It seems that the only available public code on the world (Internet) is the WordCount program. I learn programming easily and faster by examples and I would appreciate it if anyone can share some simple programs written in Java/C++ for Hadoop . If there is any manuals, examples, links about writing programs for Hadoop, please share it. -- View this message in context: http://www.nabble.com/How-to-write-simple-programs-using-Hadoop--tp17099073p17099073.html Sent from the Hadoop core-user mailing list archive at Nabble.com.