There are 2 datanode(s) running and 2 node(s) are excluded in this operation.
I am trying to copy some data with distcp and I get the error: There are 2 datanode(s) running and 2 node(s) are excluded in this operation. I did not excluded any node, I have lots of space and the hdfs is not in safemode. the command that I use is /home/ubuntu/Programs/hadoop/bin/hadoop distcp hdfs://host1:9000/wiki hdfs://host2:9000/wiki Here are the hdfs-site.xml of host1 and host2 configuration property namedfs.replication/name value2/value /property property namedfs.permissions/name valuefalse/value /property property namedfs.name.dir/name value/tmp/data/dfs/name//value /property property namedfs.data.dir/name value/tmp/data/dfs/data//value /property /configuration What is wrong? -- Best regards,
How run Aggregator wordcount?
Aggregator wordcount accept multiple folders as input? e.g. bin/hadoop jar hadoop-*-examples.jar aggregatewordcount inputfolder1 inputfolder2 inputfolder3 outfolder1 -- Best regards,
launch aggregatewordcount and sudoku in Yarn
How I run an aggregatewordcount and sudoku in Yarn? Do I need any input files, more exactly in Sudoku? -- Best regards,
I just want the last 4 jobs in the job history in Yarn?
Is it possible to say that I just want the last 4 jobs in the job history in Yarn? -- Best regards,
Re: mapred queue -list
What does it mean max-capacity can be configured to be greater than capacity? If max-capacity is greater than capacity, there isn't an overload of the queue? On 14 June 2013 22:16, Arun C Murthy a...@hortonworks.com wrote: Capacity is 'guaranteed' capacity, while max-capacity can we configured to be greater than capacity. Arun On Jun 13, 2013, at 5:28 AM, Pedro Sá da Costa wrote: When I launch the command mapred queue -list I have this output: Scheduling Info : Capacity: 100.0, MaximumCapacity: 1.0, CurrentCapacity: 0.0 What is the difference between Capacity and MaximumCapacity fields? -- Best regards, -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ -- Best regards,
Re: Get the history info in Yarn
and how can I get these values using the job id in java? On 13 June 2013 08:15, Devaraj k devara...@huawei.com wrote: As per my understanding as of now start and end times are not available through shell command. You can use the JobClient API to get the same. ** ** ThanksRegards Devaraj ** ** *From:* Pedro Sá da Costa [mailto:psdc1...@gmail.com] *Sent:* 13 June 2013 11:37 *To:* mapreduce-user; Devaraj k *Subject:* Re: Get the history info in Yarn ** ** But this command doesn't tell me the job duration, or job start time and end time. How can I get this info? ** ** On 13 June 2013 07:41, Devaraj K devara...@huawei.com wrote: Hi, You can get all the details for Job using this mapred command mapred job –status Job-ID For this you need to have Job History Server Running and the same job history server address configured in the client side. Thanks Regards Devaraj K *From:* Pedro Sá da Costa [mailto:psdc1...@gmail.com] *Sent:* Thursday, June 13, 2013 10:52 AM *To:* mapreduce-user *Subject:* Get the history info in Yarn I tried the command mapred job list all to get the history of the jobs completed, but the log doesn't have the time where a jobs started, end, the number of maps and reduce, and the size of data read and written. Can I get this info by a shell command? I am using Yarn. -- Best regards, -- Best regards, -- Best regards,
mapred queue -list
When I launch the command mapred queue -list I have this output: Scheduling Info : Capacity: 100.0, MaximumCapacity: 1.0, CurrentCapacity: 0.0 What is the difference between Capacity and MaximumCapacity fields? -- Best regards,
HDFS metrics
I am using Yarn, and 1 - I want to know the average IO throughput of the HDFS (like know how fast the datanodes are writing in a disk) so that I can compare beween 2 HDFS intances. The command hdfs dfsadmin -report doesn't give me that. The HDFS has a command for that? 2 - and there is a similar thing to know how fast the data is being transferred between map and reduces? -- Best regards,
Get the history info in Yarn
I tried the command mapred job list all to get the history of the jobs completed, but the log doesn't have the time where a jobs started, end, the number of maps and reduce, and the size of data read and written. Can I get this info by a shell command? I am using Yarn. -- Best regards,
delete the job history saved in the Job History Server in Yarn
I want to delete the job history saved in the Job History Server in Yarn. How i do that? -- Best regards,
How can I sort a file with pairs Key Value in reverse order?
I created a MapReduce job example that that uses the sort mechanism of hadoop to sort a file by the key in ascending order. This is an example of the data: 7vim 2emacs 9firefox At the end, I get the result: 2emacs 7vim 9firefox Now I want to sort in reverse order, for the result be: 9firefox 7vim 2emacs How can I sort a file with pairs Key Value in reverse order? -- Best regards,
Re: How can I sort a file with pairs Key Value in reverse order?
Even with your answer I can't see how can I sort the data in reverse order. I forgot to mention that, the output result is produced by one reduce task. This means that, at any point of the execution of the job, the data must be grouped and sorted in descendent order. On 11 June 2013 13:57, Bhasker Allene allene.bhas...@gmail.com wrote: One way to approach is emit Integer.MAX_VALUE - your key as output of mapper. Example Mapper input 7 vim 2 emacs 9 firefox Mapper output (Integer.MAX_VALUE - 7) vim (Integer.MAX_VALUE - 2) emacs (Integer.MAX_VALUE - 9) firefox If you need secondary sorting on second part, you have to use composite key and write your own petitioner, comparator. Regards, Bhasker On 11/06/2013 11:10, Pedro Sá da Costa wrote: I created a MapReduce job example that that uses the sort mechanism of hadoop to sort a file by the key in ascending order. This is an example of the data: 7vim 2emacs 9firefox At the end, I get the result: 2emacs 7vim 9firefox Now I want to sort in reverse order, for the result be: 9firefox 7vim 2emacs How can I sort a file with pairs Key Value in reverse order? -- Best regards, -- Thanks Regards, Bhasker Allene -- Best regards,
Re: How can I sort a file with pairs Key Value in reverse order?
Thanks for your help. Now I get it. On 11 June 2013 14:21, Bhasker Allene allene.bhas...@gmail.com wrote: Mapper Input 7 vim 2 emacs 9 firefox Mapper output ( new key = Integer.MAX_VALUE - key value) 2147483640 vim 2147483645 emacs 2147483636 firefox Note :Integer.MAX_VALUE is 2147483647 (which would be 2^31 - 1) Hadoop will sort the records for you. If you are using single reducer, reducer input would be 2147483636 firefox 2147483640 vim 2147483645 emacs reducer output ( this time subtract key from Integer.MAX_VALUE to get back original value) 9 firefox 7 vim 2 emacs On 11/06/2013 13:05, Pedro Sá da Costa wrote: Even with your answer I can't see how can I sort the data in reverse order. I forgot to mention that, the output result is produced by one reduce task. This means that, at any point of the execution of the job, the data must be grouped and sorted in descendent order. On 11 June 2013 13:57, Bhasker Allene allene.bhas...@gmail.com wrote: One way to approach is emit Integer.MAX_VALUE - your key as output of mapper. Example Mapper input 7 vim 2 emacs 9 firefox Mapper output (Integer.MAX_VALUE - 7) vim (Integer.MAX_VALUE - 2) emacs (Integer.MAX_VALUE - 9) firefox If you need secondary sorting on second part, you have to use composite key and write your own petitioner, comparator. Regards, Bhasker On 11/06/2013 11:10, Pedro Sá da Costa wrote: I created a MapReduce job example that that uses the sort mechanism of hadoop to sort a file by the key in ascending order. This is an example of the data: 7vim 2emacs 9firefox At the end, I get the result: 2emacs 7vim 9firefox Now I want to sort in reverse order, for the result be: 9firefox 7vim 2emacs How can I sort a file with pairs Key Value in reverse order? -- Best regards, -- Thanks Regards, Bhasker Allene -- Best regards, -- Thanks Regards, Bhasker Allene -- Best regards,
replace separator in output.collect?
the output.collect(key, value) puts the key and the value separated by \t. Is there a way to replace it by ':'? -- Best regards,
split big files into small ones to later copy
I have one 500GB plain-text file in HDFS, and I want to copy locally, to zip it and put it on another machine in a local disk. The problem is that I don't have enough space in the local disk where HDFS is, to then zip it and transfer to another host. Can I split the file into small files to be able to copy to the local disk? Any suggestions on how to do a copy? -- Best regards,
Count lines example
I am trying to create a mapreduce example that add values of same keys. E.g. the input A 1 A 2 B 4 get the output A 3 B4 The problem is that I cannot make the program read 2 inputs. How I do that? Here is my example: package org.apache.hadoop.examples; import java.io.IOException; import java.util.ArrayList; import java.util.Arrays; import java.util.Iterator; import java.util.List; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; /** * This is an example Hadoop Map/Reduce application. * It takes in several outputs of the count lines and sum them together acordinc the line. * * To run: bin/hadoop jar build/countlinesaggregator.jar *[-m imaps/i] [-r ireduces/i] iin-dirs/i iout-dir/i * e.g. * bin/hadoop jar countlinesaggregator.jar /gutenberg-output1 /gutenberg-output2 /final-output */ public class CountLinesAggregator extends Configured implements Tool { /** * Aggregate keys and values. * For each line of input, break the line into words and emit them as * (blines/b, bval/b). */ public static class MapClass extends MapReduceBase implements MapperLongWritable, Text, Text, IntWritable { private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line, \n); while (itr.hasMoreTokens()) { String token = itr.nextToken(); if(token.length() 0 ) { System.out.println(Token: + token); String[] splits = token.split(\t); if(splits[0] != null splits[1] != null splits[0].length() 0 splits[1].length() 0) { System.out.println(Arrays.deepToString(splits)); String k = splits[0]; String v = splits[1]; word.set(k); IntWritable val = new IntWritable(Integer.valueOf(v)); output.collect(word, val); } } } } } /** * A reducer class that just emits the sum of the input values. */ public static class Reduce extends MapReduceBase implements ReducerText, IntWritable, Text, IntWritable { public void reduce(Text key, IteratorIntWritable values, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } static int printUsage() { System.out.println(countlinesaggregator [-m maps] [-r reduces] input1 input2 output); ToolRunner.printGenericCommandUsage(System.out); return -1; } /** * The main driver for word count map/reduce program. * Invoke this method to submit the map/reduce job. * @throws IOException When there is communication problems with the * job tracker. */ public int run(String[] args) throws Exception { JobConf conf = new JobConf(getConf(), CountLinesAggregator.class); conf.setJobName(countlinesaggregator); // the keys are words (strings) conf.setOutputKeyClass(Text.class); // the values are counts (ints) conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setNumReduceTasks(1); ListString other_args = new ArrayListString(); for(int i=0; i args.length; ++i) { try { if (-m.equals(args[i])) { conf.setNumMapTasks(Integer.parseInt(args[++i])); } else if (-r.equals(args[i])) { conf.setNumReduceTasks(Integer.parseInt(args[++i])); } else { other_args.add(args[i]); } } catch (NumberFormatException
Re: Count lines example
I made a mistake in my example. Given 2 files with the same content: file 1 | file 2 A 3 | A 3 B 4 | B 4 gives the output A 6 B 8 On 5 June 2013 21:08, Pedro Sá da Costa psdc1...@gmail.com wrote: I am trying to create a mapreduce example that add values of same keys. E.g. the input A 1 A 2 B 4 get the output A 3 B4 The problem is that I cannot make the program read 2 inputs. How I do that? Here is my example: package org.apache.hadoop.examples; import java.io.IOException; import java.util.ArrayList; import java.util.Arrays; import java.util.Iterator; import java.util.List; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; /** * This is an example Hadoop Map/Reduce application. * It takes in several outputs of the count lines and sum them together acordinc the line. * * To run: bin/hadoop jar build/countlinesaggregator.jar *[-m imaps/i] [-r ireduces/i] iin-dirs/i iout-dir/i * e.g. * bin/hadoop jar countlinesaggregator.jar /gutenberg-output1 /gutenberg-output2 /final-output */ public class CountLinesAggregator extends Configured implements Tool { /** * Aggregate keys and values. * For each line of input, break the line into words and emit them as * (blines/b, bval/b). */ public static class MapClass extends MapReduceBase implements MapperLongWritable, Text, Text, IntWritable { private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line, \n); while (itr.hasMoreTokens()) { String token = itr.nextToken(); if(token.length() 0 ) { System.out.println(Token: + token); String[] splits = token.split(\t); if(splits[0] != null splits[1] != null splits[0].length() 0 splits[1].length() 0) { System.out.println(Arrays.deepToString(splits)); String k = splits[0]; String v = splits[1]; word.set(k); IntWritable val = new IntWritable(Integer.valueOf(v)); output.collect(word, val); } } } } } /** * A reducer class that just emits the sum of the input values. */ public static class Reduce extends MapReduceBase implements ReducerText, IntWritable, Text, IntWritable { public void reduce(Text key, IteratorIntWritable values, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } static int printUsage() { System.out.println(countlinesaggregator [-m maps] [-r reduces] input1 input2 output); ToolRunner.printGenericCommandUsage(System.out); return -1; } /** * The main driver for word count map/reduce program. * Invoke this method to submit the map/reduce job. * @throws IOException When there is communication problems with the * job tracker. */ public int run(String[] args) throws Exception { JobConf conf = new JobConf(getConf(), CountLinesAggregator.class); conf.setJobName(countlinesaggregator); // the keys are words (strings) conf.setOutputKeyClass(Text.class); // the values are counts (ints) conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setNumReduceTasks(1); ListString other_args = new ArrayListString(); for(int i=0; i args.length; ++i) { try { if (-m.equals
Print logs in MapReduce example
Hi, I created my mapreduce example for hadoop 2.0.4, and how I print the logs in the console output? The System.out.println(), Logger.getRootLogger(), and Logger.getLogge(MyClass.class) don't print nothing. Here is my code. public class WordCountAggregator extends Configured implements Tool { public static Logger LOG = Logger.getLogger(WordCountAggregator.class); public static Logger LOG2 = Logger.getRootLogger(); /** * Counts the words in each line. * For each line of input, break the line into words and emit them as * (bword/b, b1/b). */ public static class MapClass extends MapReduceBase implements MapperLongWritable, Text, Text, IntWritable { private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { LOG.setLevel(Level.INFO); LOG.addAppender(new ConsoleAppender()); String line = value.toString(); System.out.println(LL + line); LOG.debug(LL + line); LOG2.debug(LL + line); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { String l = itr.nextToken(); LOG.info(ABC + l); String[] splits = l.split( ); word.set(splits[0]); output.collect(word, new IntWritable(Integer.valueOf(splits[1]))); } } } /** * A reducer class that just emits the sum of the input values. */ public static class Reduce extends MapReduceBase implements ReducerText, IntWritable, Text, IntWritable { public void reduce(Text key, IteratorIntWritable values, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } } -- Best regards,
set HTTPFS in Hadoop 2.0.4
Hi, I set the HTTFS of Hadoop 2.0.4 to run in the port 3888. Now I want to access the filesystem but I can't do it. Here's the URL that I am using, and the config files. How can I fix this? http://host:3888/webhdfs/v1/user/myuser?user.name=myuserop=list {RemoteException:{message:java.lang.IllegalArgumentException: No enum const class org.apache.hadoop.fs.http.client.HttpFSFileSystem$Operation.LIST, exception:QueryParamException,javaClassName: com.sun.jersey.api.ParamException$QueryParamException}} $ netstat -plnet Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name tcp0 0 0.0.0.0:38880.0.0.0:* LISTEN 78250 109481130 1580/java HTTPFS server is running $ cat etc/hadoop/httpfs-env.sh #!/bin/bash # Set httpfs specific environment variables here. # Settings for the Embedded Tomcat that runs HttpFS # Java System properties for HttpFS should be specified in this variable # export CATALINA_OPTS= # HttpFS logs directory # export HTTPFS_LOG=${HTTPFS_HOME}/logs # HttpFS temporary directory # export HTTPFS_TEMP=${HTTPFS_HOME}/temp # The HTTP port used by HttpFS export HTTPFS_HTTP_PORT=3888 # The Admin port used by HttpFS # export HTTPFS_ADMIN_PORT=`expr ${HTTPFS_HTTP_PORT} + 1` # The hostname HttpFS server runs on export HTTPFS_HTTP_HOSTNAME=`hostname -f` $ cat etc/hadoop/httpfs-site.xml ?xml version=1.0 encoding=UTF-8? configuration property namehttpfs.proxyuser.myuser.hosts/name value*/value /property property namehttpfs.proxyuser.myuser.groups/name value*/value /property property namehttpfs.authentication.type/name valuesimple/value /property -- Best regards,
distcp in Hadoop 2.0.4 over http?
I want to copy HDFS filese over HTTP using distcp, but I can't. It is a problem of configuration that I can't find it. How can I do distcp in Hadoop 2.0.4 over HTTP? First I set up hadoop 2.0.4 over http - Httpfs - on port 3888, which is running. Here is the proof: $ curl -i http://zk1.host.com:3888?user.name=babuop=homedir [1] 32129 [myuser@zk1 hadoop]$ HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Accept-Ranges: bytes ETag: W/674-136580299 Last-Modified: Fri, 12 Apr 2013 21:43:10 GMT Content-Type: text/html Content-Length: 674 Date: Sat, 01 Jun 2013 15:48:04 GMT ?xml version=1.0 encoding=UTF-8? html body bHttpFs service/b, service base URL at /webhdfs/v1. /body /html But, when I do distcp, I can't copy: $ hadoop distcp http://zk1.host:3888/gutenberg/a.txt http://zk1.host:3888/ Warning: $HADOOP_HOME is deprecated. Copy failed: java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1434) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1455) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:635) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656) at org.apache.hadoop.tools.DistCp.run(DistCp.java:881) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:908) $ hadoop distcp httpfs://zk1.host:3888/gutenberg/a.txt httpfs://zk1.host:3888/ Copy failed: java.io.IOException: No FileSystem for scheme: httpfs at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1434) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1455) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:635) at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656) at org.apache.hadoop.tools.DistCp.run(DistCp.java:881) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.tools.DistCp.main(DistCp.java:908) $ hadoop distcp hdfs://zk1.host3888/gutenberg/a.txt hdfs://zk1.host:3888/ Copy failed: java.io.IOException: Call to zk1.host/127.0.0.1:3888http://zk1.yrl.gq1.yahoo.com/98.137.30.10:3888failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:1144) at org.apache.hadoop.ipc.Client.call(Client.java:1112) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) Here is my core-site files and httpfs-env.sh where I configured HDFS and the HTTPFS: $ cat etc/hadoop/core-site.xml configuration property namefs.default.name/name valuehdfs://zk1.host:9000/value /property propertynamehadoop.proxyuser.myuser.hosts/namevaluezk1.host/value /property property namehadoop.proxyuser.myuser.groups/name value*/value /property /configuration $ cat etc/hadoop/httpfs-env.sh #!/bin/bash export HTTPFS_HTTP_PORT=3888 export HTTPFS_HTTP_HOSTNAME=`hostname -f` -- Best regards,
how launch mapred in hadoop 2.0.4?
In hadoop mapreduce there is the need to launch mapred (mapred start), or launching yarn ( $sbin/yarn-daemon.sh start resourcemanager ; sbin/yarn-daemon.sh start nodemanager) is the same thing? -- Best regards,
Queues in hadoop 2.0.4
I am using hadoop 2.0.4 1 - Which component manage queues? Is it the jobtracker? 2 - If so, it is possible to define several queues (set mapred.job.queue.name=$QUEUE_NAME;)? -- Best regards,
copy data between hosts and using hdfs proxy.
Hi, I want to copy data between hosts in Hadoop 2.0.4. But the hosts are using HDFS Proxy on port 3888. I tried with the protocol hftp, httpfs, and hdfs. All the examples didn't work. hadoop distcp hftp://host1:3888/user/out/part-m-00029 hftp://host2:3888/ Any suggestion? -- Best regards,
Re: Mapreduce queues
In this article ( http://developer.yahoo.com/blogs/hadoop/next-generation-apache-hadoop-mapreduce-scheduler-4141.html), it is referred that the The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. (...) The Scheduler then allocates resources based on application-specific constraints such as appropriate machines and global constraints such as capacities of the application, queue, user etc. Maybe this is not the right place to put this question, but I just wanted to know if mapreduce use the term queue? If so, what is a queue for a mapreduce? On 27 May 2013 09:36, Harsh J ha...@cloudera.com wrote: Can you rephrase your question to include definitions of what you mean by 'queues' and what you mean by 'clusters'? On Wed, May 22, 2013 at 7:22 PM, Pedro Sá da Costa psdc1...@gmail.com wrote: Hi, When a cluster has several queues, the JobTracker has to manage all clusters? -- Best regards, -- Harsh J -- Best regards,
HDFS counters
I am analyzing some HDFS counters, and I have these questions? 1 - The HDFS: Number of bytes read as long as the map tasks read data from the HDFS, or it is a pre-calculated sum before the mappers start to read? 2 - With these metrics, it was written some data in the HDFS before the map tasks start. Does anyone have an opinion if it is possible the map tasks write the intermediate output in thi HDFS? This happens because this job defined by the user forces to (I don't know what this job does)? mapcompletionmap() completion: 0.9946828/mapcompletion redcompletionreduce() completion: 0.0/redcompletion hdfsHDFS: Number of bytes read=314470180/hdfs hdfsHDFS: Number of bytes written=313912087/hdfs -- Best regards,
hadoop queue -list
1 - I am looking to the queue list in my system, and I have several queues defined. And, in one of the queue I have this info: Scheduling Info : Capacity: 1.0, MaximumCapacity: 1.0, CurrentCapacity: 77.534035 Why the current capacity is much bigger than the maximum capacity? 2 - With the queue info I can know if there is space to put more jobs running? -- Best regards,
Mapreduce queues
Hi, When a cluster has several queues, the JobTracker has to manage all clusters? -- Best regards,
job -list parameters
When I list all the jobs running, I get several parameters related to the job. What are the parameters asked below? JobId State StartTime UserName Queue Priority UsedContainers -- what is this parameter? RsvdContainers -- what is this parameter? UsedMem RsvdMem -- what is this parameter? NeededMem AM info -- Best regards,
Combine data from different HDFS FS
Hi, I want to combine the data that are in different HDFS filesystems, for them to be executed in one job. Is it possible to do this with MR, or there is another Apache tool that allows me to do this? Eg. Hdfs data in Cluster1 v Hdfs data in Cluster2 - this job reads the data from Cluster1, 2 Thanks, -- Best regards,
Re: Combine data from different HDFS FS
I'm invoking the wordcount example in host1 with this command, but I got an error. HOST1:$ bin/hadoop jar hadoop-examples-1.0.4.jar wordcount hdfs://HOST2:54310/gutenberg gutenberg-output 13/04/08 22:02:55 ERROR security.UserGroupInformation: PriviledgedActionException as:ubuntu cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://HOST2:54310/gutenberg org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://HOST2:54310/gutenberg Can you be more specific about using the FileinputFormat? It's because I've configured MapReduce and HDFS to work in HOST, and I don't know how can I make an wordcount that reads the data from the HDFS from files in HOST1 and HOST2? On 8 April 2013 19:34, Harsh J ha...@cloudera.com wrote: You should be able to add fully qualified HDFS paths from N clusters into the same job via FileInputFormat.addInputPath(…) calls. Caveats may apply for secure environments, but for non-secure mode this should work just fine. Did you try this and did it not work? On Mon, Apr 8, 2013 at 9:56 PM, Pedro Sá da Costa psdc1...@gmail.com wrote: Hi, I want to combine the data that are in different HDFS filesystems, for them to be executed in one job. Is it possible to do this with MR, or there is another Apache tool that allows me to do this? Eg. Hdfs data in Cluster1 v Hdfs data in Cluster2 - this job reads the data from Cluster1, 2 Thanks, -- Best regards, -- Harsh J -- Best regards,
Re: Combine data from different HDFS FS
Maybe there is some FileInputFormat class that allows to define input files from different locations. What I would like to know, is if it's possible to read input data from different HDFS FS. E.g., run the wordcount with the input files from HDFS FS in HOST1 and HOST2 (the FS in HOST1 and HOST2 are distinct). Any suggestion on which InputFormat I should use? On 9 April 2013 00:10, Pedro Sá da Costa psdc1...@gmail.com wrote: I'm invoking the wordcount example in host1 with this command, but I got an error. HOST1:$ bin/hadoop jar hadoop-examples-1.0.4.jar wordcount hdfs://HOST2:54310/gutenberg gutenberg-output 13/04/08 22:02:55 ERROR security.UserGroupInformation: PriviledgedActionException as:ubuntu cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://HOST2:54310/gutenberg org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://HOST2:54310/gutenberg Can you be more specific about using the FileinputFormat? It's because I've configured MapReduce and HDFS to work in HOST, and I don't know how can I make an wordcount that reads the data from the HDFS from files in HOST1 and HOST2? On 8 April 2013 19:34, Harsh J ha...@cloudera.com wrote: You should be able to add fully qualified HDFS paths from N clusters into the same job via FileInputFormat.addInputPath(…) calls. Caveats may apply for secure environments, but for non-secure mode this should work just fine. Did you try this and did it not work? On Mon, Apr 8, 2013 at 9:56 PM, Pedro Sá da Costa psdc1...@gmail.com wrote: Hi, I want to combine the data that are in different HDFS filesystems, for them to be executed in one job. Is it possible to do this with MR, or there is another Apache tool that allows me to do this? Eg. Hdfs data in Cluster1 v Hdfs data in Cluster2 - this job reads the data from Cluster1, 2 Thanks, -- Best regards, -- Harsh J -- Best regards, -- Best regards,
set the namenode public IP address in amazon EC2?
Hi, I'm trying to configure the Namenode with a public IP in amazon EC2. The service always get the host private IP, and not the public one. How can I set the namenode public IP address? Here are my configuration files: $cat hdfs-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namedfs.replication/name value3/value /property property namefs.default.name/name valuehdfs:// ec2-46-XX.eu-west-1.compute.amazonaws.com:54310/value /property property namedfs.data.dir/name value/home/ubuntu/MRtmp/dfs/data/value /property property namedfs.permissions/namevaluefalse/value/property property namedfs.permissions.enabled/namevaluefalse/value/property propertynamedfs.datanode.data.dir.perm/namevalue777/value/property /configuration $ cat core-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration propertynamehadoop.tmp.dir/namevalue/home/ubuntu/MRtmp/dir/hadoop-${ user.name}/value/property propertynamehadoop.backup.files/namevaluetrue/value/property propertynamehadoop.tmp.bkp.dir/namevalue/home/ubuntu/MRtmp/backup/dir/hadoop-${ user.name}/value/property propertynamefs.default.name/namevaluehdfs:// ec2-46-XXX.eu-west-1.compute.amazonaws.com:54310/value/property propertynamehadoop.security.authentication/namevaluesimple/value/property propertynamehadoop.security.authorization/namevaluefalse/value/property /configuration -- Best regards,
Is it possible to set FS permissions (e.g. 755) in hdfs-site.xml?
Is it possible to set FS permissions (e.g. 755) in hdfs-site.xml? -- Best regards,
FSDataOutputStream can write in a file in a remote host?
FSDataOutputStream can write in a file in a remote host? -- Best regards,
FSDataOutputStream hangs in out.close()
Hi, I'm using the Hadoop 1.0.4 API to try to submit a job in a remote JobTracker. I created modfied the JobClient to submit the same job in different JTs. E.g, the JobClient is in my PC and it try to submit the same Job in 2 JTs at different sites in Amazon EC2. When I'm launching the Job, in the setup phase, the JobClient is trying to submit split file info into the remote JT. This is the method of the JobClient that I've the problem: public static void createSplitFiles(Path jobSubmitDir, Configuration conf, FileSystem fs, org.apache.hadoop.mapred.InputSplit[] splits) throws IOException { FSDataOutputStream out = createFile(fs, JobSubmissionFiles.getJobSplitFile(jobSubmitDir), conf); SplitMetaInfo[] info = writeOldSplits(splits, out, conf); out.close(); writeJobSplitMetaInfo(fs,JobSubmissionFiles.getJobSplitMetaFile(jobSubmitDir), new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION), splitVersion, info); } 1 - The FSDataOutputStream hangs in the out.close() instruction. Why it hangs? What should I do to solve this? -- Best regards,
Re: FSDataOutputStream hangs in out.close()
Hi, I'm trying to make the same client to talk to different HDFS and JT instances that are in different sites of Amazon EC2. The error that I got is: java.io.IOException: Got error for OP_READ_BLOCK, self=/XXX.XXX.XXX.123:44734, remote=ip-XXX-XXX-XXX-123.eu-west-1.compute.internal/XXX.XXX.XXX.123:50010, for file ip-XXX-XXX-XXX-123.eu-west-1.compute.internal/XXX.XXX.XXX.123:50010:-4664365259588027316, for block -4664365259588027316_2050 This error means than it wasn't possible to write on a remote host? On 27 March 2013 12:24, Harsh J ha...@cloudera.com wrote: You can try to take a jstack stack trace and see what its hung on. I've only ever noticed a close() hang when the NN does not accept the complete-file call (due to minimum replication not being guaranteed), but given your changes (which I haven't an idea about yet) it could be something else as well. You're essentially trying to make the same client talk to two different FSes I think (aside of the JT RPC). On Wed, Mar 27, 2013 at 5:50 PM, Pedro Sá da Costa psdc1...@gmail.com wrote: Hi, I'm using the Hadoop 1.0.4 API to try to submit a job in a remote JobTracker. I created modfied the JobClient to submit the same job in different JTs. E.g, the JobClient is in my PC and it try to submit the same Job in 2 JTs at different sites in Amazon EC2. When I'm launching the Job, in the setup phase, the JobClient is trying to submit split file info into the remote JT. This is the method of the JobClient that I've the problem: public static void createSplitFiles(Path jobSubmitDir, Configuration conf, FileSystem fs, org.apache.hadoop.mapred.InputSplit[] splits) throws IOException { FSDataOutputStream out = createFile(fs, JobSubmissionFiles.getJobSplitFile(jobSubmitDir), conf); SplitMetaInfo[] info = writeOldSplits(splits, out, conf); out.close(); writeJobSplitMetaInfo(fs,JobSubmissionFiles.getJobSplitMetaFile(jobSubmitDir), new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION), splitVersion, info); } 1 - The FSDataOutputStream hangs in the out.close() instruction. Why it hangs? What should I do to solve this? -- Best regards, -- Harsh J -- Best regards,
Re: FSDataOutputStream hangs in out.close()
I can add this information taken from the datanode logs, but it seems something related to blocks: nfoPort=50075, ipcPort=50020):Got exception while serving blk_-4664365259588027316_2050 to /XXX.XXX.XXX.123: java.io.IOException: Block blk_-4664365259588027316_2050 is not valid. at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:1072) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:1035) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getVisibleLength(FSDataset.java:1045) at org.apache.hadoop.hdfs.server.datanode.BlockSender.init(BlockSender.java:94) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) at java.lang.Thread.run(Thread.java:662) 2013-03-27 15:44:54,965 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(XXX.XXX.XXX.123:50010, storageID=DS-595468034-XXX.XXX.XXX.123-50010-1364122596021, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Block blk_-4664365259588027316_2050 is not valid. at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:1072) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:1035) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getVisibleLength(FSDataset.java:1045) at org.apache.hadoop.hdfs.server.datanode.BlockSender.init(BlockSender.java:94) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) at java.lang.Thread.run(Thread.java:662) I still have no idea why this error, if the 2 HDFS instances have the same data. On 27 March 2013 15:53, Pedro Sá da Costa psdc1...@gmail.com wrote: Hi, I'm trying to make the same client to talk to different HDFS and JT instances that are in different sites of Amazon EC2. The error that I got is: java.io.IOException: Got error for OP_READ_BLOCK, self=/XXX.XXX.XXX.123:44734, remote=ip-XXX-XXX-XXX-123.eu-west-1.compute.internal/XXX.XXX.XXX.123:50010, for file ip-XXX-XXX-XXX-123.eu-west-1.compute.internal/XXX.XXX.XXX.123:50010:-4664365259588027316, for block -4664365259588027316_2050 This error means than it wasn't possible to write on a remote host? On 27 March 2013 12:24, Harsh J ha...@cloudera.com wrote: You can try to take a jstack stack trace and see what its hung on. I've only ever noticed a close() hang when the NN does not accept the complete-file call (due to minimum replication not being guaranteed), but given your changes (which I haven't an idea about yet) it could be something else as well. You're essentially trying to make the same client talk to two different FSes I think (aside of the JT RPC). On Wed, Mar 27, 2013 at 5:50 PM, Pedro Sá da Costa psdc1...@gmail.com wrote: Hi, I'm using the Hadoop 1.0.4 API to try to submit a job in a remote JobTracker. I created modfied the JobClient to submit the same job in different JTs. E.g, the JobClient is in my PC and it try to submit the same Job in 2 JTs at different sites in Amazon EC2. When I'm launching the Job, in the setup phase, the JobClient is trying to submit split file info into the remote JT. This is the method of the JobClient that I've the problem: public static void createSplitFiles(Path jobSubmitDir, Configuration conf, FileSystem fs, org.apache.hadoop.mapred.InputSplit[] splits) throws IOException { FSDataOutputStream out = createFile(fs, JobSubmissionFiles.getJobSplitFile(jobSubmitDir), conf); SplitMetaInfo[] info = writeOldSplits(splits, out, conf); out.close(); writeJobSplitMetaInfo(fs,JobSubmissionFiles.getJobSplitMetaFile(jobSubmitDir), new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION), splitVersion, info); } 1 - The FSDataOutputStream hangs in the out.close() instruction. Why it hangs? What should I do to solve this? -- Best regards, -- Harsh J -- Best regards, -- Best regards,
Re: FSDataOutputStream hangs in out.close()
I just create 2 different FS instances. On Wednesday, 27 March 2013, Harsh J wrote: Same data does not mean same block IDs across two clusters. I'm guessing this is cause of some issue in your code when wanting to write to two different HDFS instances with the same client. Did you do a low level mod for HDFS writes as well or just create two different FS instances when you want to write to different ones? On Wed, Mar 27, 2013 at 9:34 PM, Pedro Sá da Costa psdc1...@gmail.com wrote: I can add this information taken from the datanode logs, but it seems something related to blocks: nfoPort=50075, ipcPort=50020):Got exception while serving blk_-4664365259588027316_2050 to /XXX.XXX.XXX.123: java.io.IOException: Block blk_-4664365259588027316_2050 is not valid. at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:1072) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:1035) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getVisibleLength(FSDataset.java:1045) at org.apache.hadoop.hdfs.server.datanode.BlockSender.init(BlockSender.java:94) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) at java.lang.Thread.run(Thread.java:662) 2013-03-27 15:44:54,965 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(XXX.XXX.XXX.123:50010, storageID=DS-595468034-XXX.XXX.XXX.123-50010-1364122596021, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Block blk_-4664365259588027316_2050 is not valid. at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:1072) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:1035) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getVisibleLength(FSDataset.java:1045) at org.apache.hadoop.hdfs.server.datanode.BlockSender.init(BlockSender.java:94) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) at java.lang.Thread.run(Thread.java:662) I still have no idea why this error, if the 2 HDFS instances have the same data. On 27 March 2013 15:53, Pedro Sá da Costa psdc1...@gmail.com wrote: Hi, I'm trying to make the same client to talk to different HDFS and JT instances that are in different sites of Amazon EC2. The error that I got is: java.io.IOException: Got error for OP_READ_BLOCK, self=/XXX.XXX.XXX.123:44734, remote=ip-XXX-XXX-XXX-123.eu-west-1.compute.internal/XXX.XXX.XXX.123:50010, for file ip-XXX-XXX-XXX-123.eu-west-1.compute.internal/XXX.XXX.XXX.123:50010:-4664365259588027316, for block -4664365259588027316_2050 This error means than it wasn't possible to write on a remote host? On 27 March 2013 12:24, Harsh J ha...@cloudera.com wrote: You can try to take a jstack stack trace and see what its hung on. I've only ever noticed a close() hang when the NN does not accept the complete-file call (due to minimum replication not being guaranteed), but given your changes (which I haven't an idea about yet) it could be something else as well. You're essentially trying to make the same client talk to two different FSes I think (aside of the JT RPC). On Wed, Mar 27, 2013 at 5:50 PM, Pedro Sá da Costa psdc1...@gmail.com wrote: Hi, -- Harsh J -- Best regards,
Re: is it possible to disable security in MapReduce to avoid having PriviledgedActionException?
This is my error (stacktrace below). It cannot find org.apache.hadoop.security.KerberosName class. But the strange is that I have hadoop-core-1.0.4-SNAPSHOT.jar in the classpath, and the path to the jar is correct. I've no idea what the problem is. Any help? java.io.IOException: failure to login at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490) at org.apache.hadoop.mapred.manager.DeferredScheduler$1.run(DeferredScheduler.java:80) Caused by: javax.security.auth.login.LoginException: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.security.KerberosName at org.apache.hadoop.security.User.init(User.java:44) at org.apache.hadoop.security.User.init(User.java:39) at org.apache.hadoop.security.UserGroupInformation$HadoopLoginModule.commit(UserGroupInformation.java:130) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at javax.security.auth.login.LoginContext.invoke(LoginContext.java:769) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:186) at javax.security.auth.login.LoginContext$5.run(LoginContext.java:706) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:703) at javax.security.auth.login.LoginContext.login(LoginContext.java:576) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471) On 25 March 2013 02:11, Harsh J ha...@cloudera.com wrote: What is the exact error you're getting? Can you please paste with the full stack trace and your version in use? Many times the PriviledgedActionException is just a wrapper around the real cause and gets overlooked. It does not necessarily appear due to security code (whether security is enabled or disabled). In any case, if you meant to run MR with zero UGI.doAs (which will wrap with that exception) then no, thats not possible to do. On Mon, Mar 25, 2013 at 12:57 AM, Pedro Sá da Costa psdc1...@gmail.com wrote: Hi, is it possible to disable security in MapReduce to avoid having PriviledgedActionException? Thanks, -- Harsh J -- Best regards,
who runs the map and reduce tasks in the unit tests
Hi, In Hadoop MR unit tests, the classes uses the ./core/org/apache/hadoop/util/Tool.java, and ./core/org/apache/hadoop/util/ToolRunner.java tosubmit the job. But to run the unit tests it seems that it's not needed the MR be running. If so, who runs the map and reduce tasks? -- Best regards, P
configure mapreduce to work with pem files.
I'm trying to configure ssh for the Hadoop mapreduce, but my nodes only communicate with each others using RSA keys in pem format. (It doesn't work) ssh user@host Permission denied (publickey). (It works) ssh -i ~/key.pem user@host The nodes in mapreduce communicate using ssh. How I configure the ssh, or the mapreduce to work with the pem file. -- Best regards, P
Re: configure mapreduce to work with pem files.
So, why it is necessary to configure ssh in Hadoop MR? On 13 February 2013 12:58, Harsh J ha...@cloudera.com wrote: Hi, Nodes in Hadoop do not communicate using SSH. See http://wiki.apache.org/hadoop/FAQ#Does_Hadoop_require_SSH.3F On Wed, Feb 13, 2013 at 5:16 PM, Pedro Sá da Costa psdc1...@gmail.com wrote: I'm trying to configure ssh for the Hadoop mapreduce, but my nodes only communicate with each others using RSA keys in pem format. (It doesn't work) ssh user@host Permission denied (publickey). (It works) ssh -i ~/key.pem user@host The nodes in mapreduce communicate using ssh. How I configure the ssh, or the mapreduce to work with the pem file. -- Best regards, P -- Harsh J -- Best regards,
Re: Save configuration data in job configuration file.
This does not save in the xml file. I think this just keep the variable in memory. On 19 January 2013 18:48, Arun C Murthy a...@hortonworks.com wrote: jobConf.set(String, String)? -- Best regards,
Save configuration data in job configuration file.
Hi I want to save some configuration data in the configuration files that belongs to the job. How can I do it? -- Best regards,
Re: When reduce tasks start in MapReduce Streaming?
So why it's called hadoop streaming, if it doesn't behave like a streaming application (The reduces don't receive data as long as it is produced by the map tasks)? On 16 January 2013 05:41, Jeff Bean jwfb...@cloudera.com wrote: me property. The reduce method is not called until the mappers are done, and the reducers are not scheduled before the threshold set by mapred.reduce.slowstart.completed.maps is reached. -- Best regards,
When reduce tasks start in MapReduce Streaming?
Hi, I read from documents that in MapReduce, the reduce tasks only start after a percentage (by default 90%) of maps end. This means that the slowest maps can delay the start of reduce tasks, and the input data that is consumed by the reduce tasks is represented as a batch of data. This means that, the scenario of having reduce tasks consuming data as long the map tasks produce it, doesn't exist. But with the in Hadoop MapReduce streaming this still happens? -- Best regards, P
Map tasks allocation in reduce slots?
MapReduce framework has map and reduce slots, that are used to track which tasks are running. When map tasks are just running, the reduce slots that the job have will be filled by map tasks? -- Best regards,
Profiler in Hadoop MapReduce
Hi I want to attach jprofiler to Hadoop MapReduce (MR). DO I need to configure MR to open porsts for the Jobtracker, tasktracker and map and reduce tasks to where I can attach jprofiler? -- Best regards,
Map output files and partitions.
Hi, There only 2 types of map output files, Sequence and Text files. If those files are going to be used as input to several reduce tasks, they need to be partitioned into blocks. Is there any SEPARATOR bits that limits each partition? Can I read a specific partition of a map output file? Is there an API for that? -- Best regards,
Get job, map and reduce times with RunningJob API.
Hi, I want to know when the job and map and reduce tasks started and ended in a job using the RunningJob API. How can I get this information? Thanks, -- Best regards,
Re: Get job, map and reduce times with RunningJob API.
For that I think I must access the JobTracker to get the TaskReports. But how can I access the JobTracker server in a java class? For JSP, you just need this instruction final JobTracker tracker = (JobTracker) application.getAttribute(job.tracker);, but for Java what I need? On 29 November 2012 11:12, Pedro Sá da Costa psdc1...@gmail.com wrote: Hi, I want to know when the job and map and reduce tasks started and ended in a job using the RunningJob API. How can I get this information? Thanks, -- Best regards, -- Best regards,
Job progress in bash
Hadoop Mapreduce has an web interface that shows the progress of running jobs. Can I get the same information about the job progress in bash? There's a program to print the progress in the terminal? Thanks, -- Best regards,
Re: Job progress in bash
Yes I can, but I want more details about the tasks, like the time that they start, ended, the duration of the shuffle. I want as much information as the hadoop job history all command can give it, but I want as the job progress. On 28 November 2012 11:32, Harsh J ha...@cloudera.com wrote: hadoop job -status -- Best regards,
Get JobInProgress given jobId
I'm building a Java class and given a JobID, how can I get the JobInProgress? Can anyone give me an example? -- Best regards,
Re: Get JobInProgress given jobId
I have the jobId as a String, and from that I want to access the RunningJob API for that jobId. I think that it is only possible to access this API through the JobInProgress class, but maybe I'm wrong. Is this true? On 28 November 2012 17:24, Mahesh Balija balijamahesh@gmail.com wrote: Hi Pedro, You can get the JobInProgress instance from JobTracker. JobInProgress getJob(JobID jobid); Best, Mahesh Balija, Calsoft Labs. On Wed, Nov 28, 2012 at 10:41 PM, Pedro Sá da Costa psdc1...@gmail.com wrote: I'm building a Java class and given a JobID, how can I get the JobInProgress? Can anyone give me an example? -- Best regards, -- Best regards,
Re: Get JobInProgress given jobId
On 28 November 2012 18:12, Harsh J ha...@cloudera.com wrote: nt application's hadoop jar same version as the server? Yes it is. 2. Is the port 54311 the proper JobTracker port? This jobtracker port is set to: property namemapred.job.tracker/name valuelocalhost:54311/value !--valuelocal/value-- descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property -- Best regards,
Re: Get JobInProgress given jobId
I've this error in jobtracker log. Maybe this is the reason. What this error means? 2012-11-28 19:19:17,697 WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 127.0.0.1:60089 got version 4 expected version 3 On 28 November 2012 18:28, Pedro Sá da Costa psdc1...@gmail.com wrote: On 28 November 2012 18:12, Harsh J ha...@cloudera.com wrote: nt application's hadoop jar same version as the server? Yes it is. 2. Is the port 54311 the proper JobTracker port? This jobtracker port is set to: property namemapred.job.tracker/name valuelocalhost:54311/value !--valuelocal/value-- descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property -- Best regards, -- Best regards,
Re: Get JobInProgress given jobId
And for this error, after all maybe I'm running hadoop.jar with different versions. I'm running hadoop-0.20 and trying to run JobClient in with hadoo-1.0 On 28 November 2012 19:20, Pedro Sá da Costa psdc1...@gmail.com wrote: I've this error in jobtracker log. Maybe this is the reason. What this error means? 2012-11-28 19:19:17,697 WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 127.0.0.1:60089 got version 4 expected version 3 On 28 November 2012 18:28, Pedro Sá da Costa psdc1...@gmail.com wrote: On 28 November 2012 18:12, Harsh J ha...@cloudera.com wrote: nt application's hadoop jar same version as the server? Yes it is. 2. Is the port 54311 the proper JobTracker port? This jobtracker port is set to: property namemapred.job.tracker/name valuelocalhost:54311/value !--valuelocal/value-- descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property -- Best regards, -- Best regards, -- Best regards,
Shoud I use MapReduce 0.2X, or 1.0?
I've noticed that Hadoop MapReduce 1.0.4 was released in 12 October 2012, and Hadoop 0.23.4 was released in 15 October 2012. I thought that with Hadoop 1.0, the Hadoop 0.2X had become discontinued. If I want to start to use Hadoop MapReduce, which version should I use? What's the difference between Hadoop MapReduce 0.2X and Hadoop MapReduce 1.0? -- Best regards,
Cannot run program autoreconf
I'm trying to compile the mapreduce, but I get the error: create-native-configure: BUILD FAILED /home/xeon/Projects/hadoop-1.0.3/build.xml:618: Execute failed: java.io.IOException: Cannot run program autoreconf (in directory /home/xeon/Projects/hadoop-1.0.3/src/native): java.io.IOException: error=2, No such file or directory What this error means? -- Best regards,
SecretKey in MapReduce
Hi, - Hadoop 1.0.2 uses a SecretKey, but I don't understand what's the purpose of that. Can anyone explain what's the purpose of the SecretKey? - Is this Secret key shared between JobTracker, TaskTrackers, and Map and Reduce tasks? -- Best regards,
splits and maps
If I've an input file of 640MB in size, and a split size of 64Mb, this file will be partitioned in 10 splits, and each split will be processed by a map task, right? -- Best regards,
How map tasks know which is in the input file?
Hi, 1 - In JobTracker in the Hadoop Mapreduce 1.0.3, there's a new JobToken. What's the purpose of the JobToken? 2 - I also notice that the input files has now some metafiles. It seems that the way how tasks get the input files for the map tasks are completely different from what hadoop 0.20.0 do. With the new version the input file name isn't given directly to the map tasks. Can someone give me an insight how the map tasks know which is the file name and path of the input split? -- Best regards,
submit a job in a remote jobtracker
I want to submit a job in a remove job tracker, how can I do it? -- Best regards,
Re: submit a job in a remote jobtracker
But this solution implies that a user must access the remote machine before submit the job. This is not what I want. I want to submit a job in my local machine, and it will be forwarded to the remote JobTracker. On 14 August 2012 14:15, Harsh J ha...@cloudera.com wrote: Hi Pedro, This has been asked before. See http://search-hadoop.com/m/bikPd1LWhhB1 (or search more on that same site) On Tue, Aug 14, 2012 at 6:32 PM, Pedro Sá da Costa psdc1...@gmail.com wrote: I want to submit a job in a remove job tracker, how can I do it? -- Best regards, -- Harsh J -- Best regards,
Can't run Hadoop MR 1.0.3 Junit tests in IDE.
Hi, I'm trying to run Hadoop Junit tests in IDE, but I got errors. I've the mapreduce running properly. I'm using the version Hadoop 1.0.3: * * 2012-08-10 13:53:05,803 ERROR mapred.MiniMRCluster (MiniMRCluster.java:run(119)) - Job tracker crashed java.lang.NullPointerException at java.io.File.init(File.java:239) at org.apache.hadoop.mapred.JobHistory.initLogDir(JobHistory.java:531) at org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:499) at org.apache.hadoop.mapred.JobTracker$2.run(JobTracker.java:2330) at org.apache.hadoop.mapred.JobTracker$2.run(JobTracker.java:2327) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:2327) at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:2188) at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:2182) at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:296) at org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner$1.run(MiniMRCluster.java:114) at org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner$1.run(MiniMRCluster.java:1) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner.run(MiniMRCluster.java:112) at java.lang.Thread.run(Thread.java:679) 2012-08-10 13:53:05,895 INFO mapred.MiniMRCluster (MiniMRCluster.java:init(188)) - mapred.local.dir is /home/xeon/workspace/hadoop-1.0.3-tests/build/test/mapred/local/0_0 2012-08-10 13:53:10,923 INFO http.HttpServer (HttpServer.java:addGlobalFilter(411)) - Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 2012-08-10 13:53:10,941 INFO mapred.TaskLogsTruncater (TaskLogsTruncater.java:init(72)) - Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2012-08-10 13:53:10,947 INFO mapred.TaskTracker (TaskTracker.java:initialize(694)) - Starting tasktracker with owner as xeon 2012-08-10 13:53:10,948 INFO mapred.TaskTracker (TaskTracker.java:initialize(710)) - Good mapred local directories are: /home/xeon/workspace/hadoop-1.0.3-tests/build/test/mapred/local/0_0 2012-08-10 13:53:10,959 WARN util.NativeCodeLoader (NativeCodeLoader.java:clinit(52)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2012-08-10 13:53:10,977 INFO ipc.Server (Server.java:run(328)) - Starting SocketReader 2012-08-10 13:53:10,979 INFO ipc.Server (Server.java:run(598)) - IPC Server Responder: starting 2012-08-10 13:53:10,979 INFO ipc.Server (Server.java:run(434)) - IPC Server listener on 58393: starting 2012-08-10 13:53:10,983 INFO ipc.Server (Server.java:run(1358)) - IPC Server handler 0 on 58393: starting 2012-08-10 13:53:10,984 INFO ipc.Server (Server.java:run(1358)) - IPC Server handler 1 on 58393: starting 2012-08-10 13:53:10,984 INFO ipc.Server (Server.java:run(1358)) - IPC Server handler 2 on 58393: starting 2012-08-10 13:53:10,987 INFO ipc.Server (Server.java:run(1358)) - IPC Server handler 3 on 58393: starting 2012-08-10 13:53:10,987 INFO mapred.TaskTracker (TaskTracker.java:initialize(794)) - TaskTracker up at: localhost.localdomain/127.0.0.1:58393 2012-08-10 13:53:10,988 INFO mapred.TaskTracker (TaskTracker.java:initialize(797)) - Starting tracker tracker_host0.foo.com: localhost.localdomain/127.0.0.1:58393 2012-08-10 13:53:12,050 INFO ipc.Client (Client.java:handleConnectionFailure(666)) - Retrying connect to server: localhost/127.0.0.1:0. Already tried 0 time(s). 2012-08-10 13:53:13,051 INFO ipc.Client (Client.java:handleConnectionFailure(666)) - Retrying connect to server: localhost/127.0.0.1:0. Already tried 1 time(s). 2012-08-10 13:53:14,053 INFO ipc.Client (Client.java:handleConnectionFailure(666)) - Retrying connect to server: localhost/127.0.0.1:0. Already tried 2 time(s). How do I run Hadoop Junit tests properly? Thanks, * * * * -- Best regards,