There are 2 datanode(s) running and 2 node(s) are excluded in this operation.

2013-08-19 Thread Pedro da Costa
I am trying to copy some data with distcp and I get the error: There are 2
datanode(s) running and 2 node(s) are excluded in this  operation. I did
not excluded any node, I have lots of space and the hdfs is not in
safemode.


the command that I use is


/home/ubuntu/Programs/hadoop/bin/hadoop distcp hdfs://host1:9000/wiki
hdfs://host2:9000/wiki



Here are the hdfs-site.xml of host1 and host2

configuration
property namedfs.replication/name value2/value
/property
property namedfs.permissions/name valuefalse/value
/property
property namedfs.name.dir/name
value/tmp/data/dfs/name//value /property
property namedfs.data.dir/name
value/tmp/data/dfs/data//value /property
/configuration


What is wrong?

-- 
Best regards,


How run Aggregator wordcount?

2013-06-22 Thread Pedro da Costa
Aggregator wordcount accept multiple folders as input? e.g. bin/hadoop jar
hadoop-*-examples.jar aggregatewordcount inputfolder1 inputfolder2
inputfolder3 outfolder1


-- 
Best regards,


launch aggregatewordcount and sudoku in Yarn

2013-06-21 Thread Pedro da Costa
How I run an aggregatewordcount and sudoku in Yarn? Do I need any input
files, more exactly in Sudoku?

-- 
Best regards,


I just want the last 4 jobs in the job history in Yarn?

2013-06-18 Thread Pedro da Costa
Is it possible to say that  I just want the last 4 jobs in the job history
in Yarn?

-- 
Best regards,


Re: mapred queue -list

2013-06-14 Thread Pedro da Costa
What does it mean max-capacity can be configured to be greater than
capacity? If max-capacity is greater than capacity, there isn't an overload
of the queue?



On 14 June 2013 22:16, Arun C Murthy a...@hortonworks.com wrote:

 Capacity is 'guaranteed' capacity, while max-capacity can we configured to
 be greater than capacity.

 Arun

 On Jun 13, 2013, at 5:28 AM, Pedro Sá da Costa wrote:


 When I launch the command mapred queue -list I have this output:


 Scheduling Info : Capacity: 100.0, MaximumCapacity: 1.0, CurrentCapacity:
 0.0

 What is the difference between Capacity and  MaximumCapacity fields?



 --
 Best regards,


  --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/





-- 
Best regards,


Re: Get the history info in Yarn

2013-06-13 Thread Pedro da Costa
and how can I get these values using the job id in java?



On 13 June 2013 08:15, Devaraj k devara...@huawei.com wrote:

  As per my understanding as of now start and end times are not available
 through shell command. You can use the  JobClient API to get the same.

 ** **

 ThanksRegards

Devaraj

 ** **

 *From:* Pedro Sá da Costa [mailto:psdc1...@gmail.com]
 *Sent:* 13 June 2013 11:37
 *To:* mapreduce-user; Devaraj k
 *Subject:* Re: Get the history info in Yarn

 ** **

 But this command doesn't tell me the job duration, or job start time and
 end time. How can I get this info?

 ** **

 On 13 June 2013 07:41, Devaraj K devara...@huawei.com wrote:

 Hi,

  

 You can get all the details for Job using this mapred command

  

 mapred job –status Job-ID

  

 For this you need to have Job History Server Running and the same job
 history server address configured in the client side.

  

  

 Thanks  Regards

 Devaraj K

  

 *From:* Pedro Sá da Costa [mailto:psdc1...@gmail.com]
 *Sent:* Thursday, June 13, 2013 10:52 AM
 *To:* mapreduce-user
 *Subject:* Get the history info in Yarn

  

 I tried the command mapred job list all to get the history of the jobs
 completed, but the log  doesn't have the time where a jobs started, end,
 the number of maps and reduce, and the size of data read and written. Can I
 get this info by a shell command?

 I am using Yarn.
 


 --
 Best regards,




 --
 Best regards,




-- 
Best regards,


mapred queue -list

2013-06-13 Thread Pedro da Costa
When I launch the command mapred queue -list I have this output:


Scheduling Info : Capacity: 100.0, MaximumCapacity: 1.0, CurrentCapacity:
0.0

What is the difference between Capacity and  MaximumCapacity fields?



-- 
Best regards,


HDFS metrics

2013-06-12 Thread Pedro da Costa
I am using Yarn, and

1 - I want to know the average IO throughput of the HDFS (like know how
fast the datanodes are writing in a disk) so that I can compare beween 2
HDFS intances. The command hdfs dfsadmin -report doesn't give me that.
The HDFS has a command for that?

2 - and there is a similar thing to know how fast the data is being
transferred between map and reduces?
-- 
Best regards,


Get the history info in Yarn

2013-06-12 Thread Pedro da Costa
I tried the command mapred job list all to get the history of the jobs
completed, but the log  doesn't have the time where a jobs started, end,
the number of maps and reduce, and the size of data read and written. Can I
get this info by a shell command?

I am using Yarn.

-- 
Best regards,


delete the job history saved in the Job History Server in Yarn

2013-06-11 Thread Pedro da Costa
I want to delete the job history saved in the Job History Server in Yarn.
How i do that?

-- 
Best regards,


How can I sort a file with pairs Key Value in reverse order?

2013-06-11 Thread Pedro da Costa
I created a MapReduce job example that that uses the sort mechanism of
hadoop to sort a file by the key in ascending order. This is an example of
the data:

7vim
2emacs
9firefox

At the end, I get the result:

2emacs
7vim
 9firefox



Now I want to sort in reverse order, for the result be:
9firefox
7vim
2emacs



How can I sort a file with pairs Key Value in reverse order?

-- 
Best regards,


Re: How can I sort a file with pairs Key Value in reverse order?

2013-06-11 Thread Pedro da Costa
Even with  your answer I can't see how can I sort the data in reverse
order. I forgot to mention that, the output result is produced by one
reduce task. This means that, at any point of the execution of the job,
 the data must be grouped and sorted in descendent order.


On 11 June 2013 13:57, Bhasker Allene allene.bhas...@gmail.com wrote:

  One way to approach is emit Integer.MAX_VALUE - your key as output of
 mapper.
 Example
 Mapper input
 7 vim
 2 emacs
 9 firefox

 Mapper output
 (Integer.MAX_VALUE - 7) vim
 (Integer.MAX_VALUE - 2) emacs
 (Integer.MAX_VALUE - 9) firefox

 If you need secondary sorting on second part, you have to use composite
 key  and write  your own petitioner, comparator.

 Regards,
 Bhasker

 On 11/06/2013 11:10, Pedro Sá da Costa wrote:

  I created a MapReduce job example that that uses the sort mechanism of
 hadoop to sort a file by the key in ascending order. This is an example of
 the data:

  7vim
  2emacs
  9firefox

  At the end, I get the result:

 2emacs
 7vim
  9firefox



 Now I want to sort in reverse order, for the result be:
 9firefox
 7vim
 2emacs



 How can I sort a file with pairs Key Value in reverse order?

 --
 Best regards,



 --
 Thanks  Regards,
 Bhasker Allene




-- 
Best regards,


Re: How can I sort a file with pairs Key Value in reverse order?

2013-06-11 Thread Pedro da Costa
Thanks for your help. Now I get it.


On 11 June 2013 14:21, Bhasker Allene allene.bhas...@gmail.com wrote:

  Mapper Input

 7 vim
 2 emacs
 9 firefox

 Mapper output ( new key = Integer.MAX_VALUE - key value)
 2147483640 vim
 2147483645 emacs
 2147483636 firefox

 Note :Integer.MAX_VALUE is 2147483647 (which would be 2^31 - 1)

 Hadoop will sort the records for you.
 If you are using single reducer, reducer input would be

 2147483636 firefox
 2147483640 vim
 2147483645 emacs

 reducer output ( this time subtract key from Integer.MAX_VALUE to get
 back original value)
 9 firefox
 7 vim
 2 emacs


  On 11/06/2013 13:05, Pedro Sá da Costa wrote:

 Even with  your answer I can't see how can I sort the data in reverse
 order. I forgot to mention that, the output result is produced by one
 reduce task. This means that, at any point of the execution of the job,
  the data must be grouped and sorted in descendent order.


 On 11 June 2013 13:57, Bhasker Allene allene.bhas...@gmail.com wrote:

  One way to approach is emit Integer.MAX_VALUE - your key as output of
 mapper.
 Example
 Mapper input
  7 vim
 2 emacs
 9 firefox

  Mapper output
 (Integer.MAX_VALUE - 7) vim
 (Integer.MAX_VALUE - 2) emacs
 (Integer.MAX_VALUE - 9) firefox

 If you need secondary sorting on second part, you have to use composite
 key  and write  your own petitioner, comparator.

 Regards,
 Bhasker

 On 11/06/2013 11:10, Pedro Sá da Costa wrote:

  I created a MapReduce job example that that uses the sort mechanism of
 hadoop to sort a file by the key in ascending order. This is an example of
 the data:

  7vim
  2emacs
  9firefox

  At the end, I get the result:

 2emacs
 7vim
  9firefox



 Now I want to sort in reverse order, for the result be:
 9firefox
 7vim
 2emacs



 How can I sort a file with pairs Key Value in reverse order?

 --
 Best regards,



   --
 Thanks  Regards,
 Bhasker Allene




  --
 Best regards,


 --
 Thanks  Regards,
 Bhasker Allene




-- 
Best regards,


replace separator in output.collect?

2013-06-11 Thread Pedro da Costa
the output.collect(key, value) puts the key and the value separated by
\t. Is there a way to replace it by ':'?

-- 
Best regards,


split big files into small ones to later copy

2013-06-07 Thread Pedro da Costa
I have one 500GB plain-text file in HDFS, and I want to copy locally, to
zip it and put it on another machine in a local disk. The problem is that I
don't have enough space in the local disk where HDFS is, to then zip it and
transfer to another host.

Can I split the file into small files to be able to copy to the local disk?
Any suggestions on how to do a copy?

-- 
Best regards,


Count lines example

2013-06-05 Thread Pedro da Costa
I am trying to create a mapreduce example that add values of same keys. E.g.
the input
A   1
A   2
B   4

get the output
A   3
B4

The problem is that I cannot make the program read 2 inputs. How I do that?

Here is my example:

package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * This is an example Hadoop Map/Reduce application.
 * It takes in several outputs of the count lines and sum them together
acordinc the line.
 *
 * To run: bin/hadoop jar build/countlinesaggregator.jar
 *[-m imaps/i] [-r ireduces/i] iin-dirs/i
iout-dir/i
 * e.g.
 *  bin/hadoop jar countlinesaggregator.jar /gutenberg-output1
/gutenberg-output2 /final-output
 */
public class CountLinesAggregator extends Configured implements Tool {
/**
 * Aggregate keys and values.
 * For each line of input, break the line into words and emit them as
 * (blines/b, bval/b).
 */
public static class MapClass extends MapReduceBase
implements MapperLongWritable, Text, Text, IntWritable {
private Text word = new Text();

public void map(LongWritable key, Text value,
OutputCollectorText, IntWritable output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line, \n);
while (itr.hasMoreTokens()) {
String token = itr.nextToken();
if(token.length() 0 ) {
System.out.println(Token:  + token);
String[] splits = token.split(\t);
if(splits[0] != null  splits[1] != null 
splits[0].length()  0  splits[1].length() 
0) {
System.out.println(Arrays.deepToString(splits));
String k = splits[0];
String v = splits[1];
word.set(k);
IntWritable val = new
IntWritable(Integer.valueOf(v));
output.collect(word, val);
}
}
}
}
}

/**
 * A reducer class that just emits the sum of the input values.
 */
public static class Reduce extends MapReduceBase
implements ReducerText, IntWritable, Text, IntWritable {

public void reduce(Text key, IteratorIntWritable values,
OutputCollectorText, IntWritable output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

static int printUsage() {
System.out.println(countlinesaggregator [-m maps] [-r reduces]
input1 input2 output);
ToolRunner.printGenericCommandUsage(System.out);
return -1;
}

/**
 * The main driver for word count map/reduce program.
 * Invoke this method to submit the map/reduce job.
 * @throws IOException When there is communication problems with the
 * job tracker.
 */
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(getConf(), CountLinesAggregator.class);
conf.setJobName(countlinesaggregator);

// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(MapClass.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setNumReduceTasks(1);

ListString other_args = new ArrayListString();
for(int i=0; i  args.length; ++i) {
try {
if (-m.equals(args[i])) {
conf.setNumMapTasks(Integer.parseInt(args[++i]));
} else if (-r.equals(args[i])) {
conf.setNumReduceTasks(Integer.parseInt(args[++i]));
} else {
other_args.add(args[i]);
}
} catch (NumberFormatException 

Re: Count lines example

2013-06-05 Thread Pedro da Costa
I made a mistake in my example.

Given 2 files with the same content:
file 1  | file 2
A   3  | A  3
B   4  | B  4

gives the output

A   6
B   8


On 5 June 2013 21:08, Pedro Sá da Costa psdc1...@gmail.com wrote:

 I am trying to create a mapreduce example that add values of same keys.
 E.g.
 the input
 A   1
 A   2
 B   4

 get the output
 A   3
 B4

 The problem is that I cannot make the program read 2 inputs. How I do that?

 Here is my example:

 package org.apache.hadoop.examples;

 import java.io.IOException;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Iterator;
 import java.util.List;
 import java.util.StringTokenizer;

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.conf.Configured;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.IntWritable;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.FileInputFormat;
 import org.apache.hadoop.mapred.FileOutputFormat;
 import org.apache.hadoop.mapred.JobClient;
 import org.apache.hadoop.mapred.JobConf;
 import org.apache.hadoop.mapred.MapReduceBase;
 import org.apache.hadoop.mapred.Mapper;
 import org.apache.hadoop.mapred.OutputCollector;
 import org.apache.hadoop.mapred.Reducer;
 import org.apache.hadoop.mapred.Reporter;
 import org.apache.hadoop.util.Tool;
 import org.apache.hadoop.util.ToolRunner;

 /**
  * This is an example Hadoop Map/Reduce application.
  * It takes in several outputs of the count lines and sum them together
 acordinc the line.
  *
  * To run: bin/hadoop jar build/countlinesaggregator.jar
  *[-m imaps/i] [-r ireduces/i] iin-dirs/i
 iout-dir/i
  * e.g.
  *  bin/hadoop jar countlinesaggregator.jar /gutenberg-output1
 /gutenberg-output2 /final-output
  */
 public class CountLinesAggregator extends Configured implements Tool {
 /**
  * Aggregate keys and values.
  * For each line of input, break the line into words and emit them as
  * (blines/b, bval/b).
  */
 public static class MapClass extends MapReduceBase
 implements MapperLongWritable, Text, Text, IntWritable {
 private Text word = new Text();

 public void map(LongWritable key, Text value,
 OutputCollectorText, IntWritable output,
 Reporter reporter) throws IOException {
 String line = value.toString();
 StringTokenizer itr = new StringTokenizer(line, \n);
 while (itr.hasMoreTokens()) {
 String token = itr.nextToken();
 if(token.length() 0 ) {
 System.out.println(Token:  + token);
 String[] splits = token.split(\t);
 if(splits[0] != null  splits[1] != null 
 splits[0].length()  0  splits[1].length() 
 0) {
 System.out.println(Arrays.deepToString(splits));
 String k = splits[0];
 String v = splits[1];
 word.set(k);
 IntWritable val = new
 IntWritable(Integer.valueOf(v));
 output.collect(word, val);
 }
 }
 }
 }
 }

 /**
  * A reducer class that just emits the sum of the input values.
  */
 public static class Reduce extends MapReduceBase
 implements ReducerText, IntWritable, Text, IntWritable {

 public void reduce(Text key, IteratorIntWritable values,
 OutputCollectorText, IntWritable output,
 Reporter reporter) throws IOException {
 int sum = 0;
 while (values.hasNext()) {
 sum += values.next().get();
 }
 output.collect(key, new IntWritable(sum));
 }
 }

 static int printUsage() {
 System.out.println(countlinesaggregator [-m maps] [-r
 reduces] input1 input2 output);
 ToolRunner.printGenericCommandUsage(System.out);
 return -1;
 }

 /**
  * The main driver for word count map/reduce program.
  * Invoke this method to submit the map/reduce job.
  * @throws IOException When there is communication problems with the
  * job tracker.
  */
 public int run(String[] args) throws Exception {
 JobConf conf = new JobConf(getConf(), CountLinesAggregator.class);
 conf.setJobName(countlinesaggregator);

 // the keys are words (strings)
 conf.setOutputKeyClass(Text.class);
 // the values are counts (ints)
 conf.setOutputValueClass(IntWritable.class);

 conf.setMapperClass(MapClass.class);
 conf.setCombinerClass(Reduce.class);
 conf.setReducerClass(Reduce.class);
 conf.setNumReduceTasks(1);

 ListString other_args = new ArrayListString();
 for(int i=0; i  args.length; ++i) {
 try {
 if (-m.equals

Print logs in MapReduce example

2013-06-03 Thread Pedro da Costa
Hi,


I created my mapreduce example for hadoop 2.0.4, and how I print the logs
in the console output? The System.out.println(), Logger.getRootLogger(),
and Logger.getLogge(MyClass.class) don't  print nothing.

Here is my code.

public class WordCountAggregator extends Configured implements Tool {

public static Logger LOG = Logger.getLogger(WordCountAggregator.class);
public static Logger LOG2 = Logger.getRootLogger();
/**
 * Counts the words in each line.
 * For each line of input, break the line into words and emit them as
 * (bword/b, b1/b).
 */
public static class MapClass extends MapReduceBase
implements MapperLongWritable, Text, Text, IntWritable {
private Text word = new Text();

public void map(LongWritable key, Text value,
OutputCollectorText, IntWritable output,
Reporter reporter) throws IOException {
LOG.setLevel(Level.INFO);
LOG.addAppender(new ConsoleAppender());
String line = value.toString();
System.out.println(LL  + line);
LOG.debug(LL  + line);
LOG2.debug(LL  + line);
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
String l = itr.nextToken();
LOG.info(ABC  + l);
String[] splits = l.split( );
word.set(splits[0]);
output.collect(word, new
IntWritable(Integer.valueOf(splits[1])));
}
}
}

/**
 * A reducer class that just emits the sum of the input values.
 */
public static class Reduce extends MapReduceBase
implements ReducerText, IntWritable, Text, IntWritable {

public void reduce(Text key, IteratorIntWritable values,
OutputCollectorText, IntWritable output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
}


-- 
Best regards,


set HTTPFS in Hadoop 2.0.4

2013-06-02 Thread Pedro da Costa
Hi,

I set the HTTFS of Hadoop 2.0.4 to run in the port 3888. Now I want to
access the filesystem but I can't do it. Here's the URL that I am using,
and the config files. How can I fix this?

http://host:3888/webhdfs/v1/user/myuser?user.name=myuserop=list

{RemoteException:{message:java.lang.IllegalArgumentException:


No enum const class
org.apache.hadoop.fs.http.client.HttpFSFileSystem$Operation.LIST,
exception:QueryParamException,javaClassName:
com.sun.jersey.api.ParamException$QueryParamException}}

$ netstat -plnet
Proto Recv-Q Send-Q Local Address   Foreign Address
State   User   Inode  PID/Program name
tcp0  0 0.0.0.0:38880.0.0.0:*
LISTEN  78250  109481130  1580/java HTTPFS server is
running

$ cat etc/hadoop/httpfs-env.sh
#!/bin/bash
# Set httpfs specific environment variables here.

# Settings for the Embedded Tomcat that runs HttpFS
# Java System properties for HttpFS should be specified in this variable
# export CATALINA_OPTS=

# HttpFS logs directory
# export HTTPFS_LOG=${HTTPFS_HOME}/logs

# HttpFS temporary directory
# export HTTPFS_TEMP=${HTTPFS_HOME}/temp

# The HTTP port used by HttpFS
export HTTPFS_HTTP_PORT=3888

# The Admin port used by HttpFS
# export HTTPFS_ADMIN_PORT=`expr ${HTTPFS_HTTP_PORT} + 1`

# The hostname HttpFS server runs on
export HTTPFS_HTTP_HOSTNAME=`hostname -f`


$ cat etc/hadoop/httpfs-site.xml
?xml version=1.0 encoding=UTF-8?
configuration
 property
   namehttpfs.proxyuser.myuser.hosts/name
   value*/value
 /property
 property
   namehttpfs.proxyuser.myuser.groups/name
   value*/value
 /property
 property
   namehttpfs.authentication.type/name
   valuesimple/value
 /property




-- 
Best regards,


distcp in Hadoop 2.0.4 over http?

2013-06-01 Thread Pedro da Costa
I want to copy HDFS filese over HTTP using distcp, but I can't. It is a
problem of configuration that I can't find it. How can I do distcp in
Hadoop 2.0.4 over HTTP?

First I set up hadoop 2.0.4 over http - Httpfs - on port 3888, which is
running. Here is the proof:

$ curl -i http://zk1.host.com:3888?user.name=babuop=homedir
[1] 32129
[myuser@zk1 hadoop]$ HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Accept-Ranges: bytes
ETag: W/674-136580299
Last-Modified: Fri, 12 Apr 2013 21:43:10 GMT
Content-Type: text/html
Content-Length: 674
Date: Sat, 01 Jun 2013 15:48:04 GMT

?xml version=1.0 encoding=UTF-8?
html
body
bHttpFs service/b, service base URL at /webhdfs/v1.
/body
/html


But, when I do distcp, I can't copy:
$ hadoop distcp  http://zk1.host:3888/gutenberg/a.txt http://zk1.host:3888/
Warning: $HADOOP_HOME is deprecated.
Copy failed: java.io.IOException: No FileSystem for scheme: http
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1434)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1455)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:635)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)

$ hadoop distcp  httpfs://zk1.host:3888/gutenberg/a.txt
httpfs://zk1.host:3888/
Copy failed: java.io.IOException: No FileSystem for scheme: httpfs
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1434)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1455)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:635)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)

$ hadoop distcp  hdfs://zk1.host3888/gutenberg/a.txt hdfs://zk1.host:3888/
Copy failed: java.io.IOException: Call to
zk1.host/127.0.0.1:3888http://zk1.yrl.gq1.yahoo.com/98.137.30.10:3888failed
on local exception: java.io.EOFException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1144)
at org.apache.hadoop.ipc.Client.call(Client.java:1112)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)

Here is my core-site files and httpfs-env.sh where I configured HDFS and
the HTTPFS:
$ cat etc/hadoop/core-site.xml
configuration
  property namefs.default.name/name
valuehdfs://zk1.host:9000/value /property
  propertynamehadoop.proxyuser.myuser.hosts/namevaluezk1.host/value
/property
  property namehadoop.proxyuser.myuser.groups/name value*/value
  /property
/configuration

$ cat etc/hadoop/httpfs-env.sh
#!/bin/bash
export HTTPFS_HTTP_PORT=3888
export HTTPFS_HTTP_HOSTNAME=`hostname -f`


-- 
Best regards,


how launch mapred in hadoop 2.0.4?

2013-06-01 Thread Pedro da Costa
In hadoop mapreduce there is the need to launch mapred (mapred start), or
launching yarn ( $sbin/yarn-daemon.sh start resourcemanager ;
sbin/yarn-daemon.sh start nodemanager) is the same thing?

-- 
Best regards,


Queues in hadoop 2.0.4

2013-05-30 Thread Pedro da Costa
I am using hadoop 2.0.4

1 - Which component manage queues? Is it the jobtracker?

2 - If so, it is possible to define several queues (set
mapred.job.queue.name=$QUEUE_NAME;)?

-- 
Best regards,


copy data between hosts and using hdfs proxy.

2013-05-29 Thread Pedro da Costa
Hi,

I want to copy data between hosts in Hadoop 2.0.4. But the hosts are
  using HDFS Proxy on port 3888. I tried with the protocol hftp,
  httpfs, and hdfs. All the examples didn't work. hadoop distcp
  hftp://host1:3888/user/out/part-m-00029 hftp://host2:3888/ Any
  suggestion?

-- 
Best regards,


Re: Mapreduce queues

2013-05-27 Thread Pedro da Costa
In this article (
http://developer.yahoo.com/blogs/hadoop/next-generation-apache-hadoop-mapreduce-scheduler-4141.html),
it is referred that the The Scheduler is responsible for allocating
resources to the various running applications subject to familiar
constraints of capacities, queues etc. (...) The Scheduler then allocates
resources based on application-specific constraints such as appropriate
machines and global constraints such as capacities of the application,
queue, user etc.

Maybe this is not the right place to put this question, but I just wanted
to know if mapreduce use the term queue? If so, what is a queue for a
mapreduce?


On 27 May 2013 09:36, Harsh J ha...@cloudera.com wrote:

 Can you rephrase your question to include definitions of what you mean
 by 'queues' and what you mean by 'clusters'?

 On Wed, May 22, 2013 at 7:22 PM, Pedro Sá da Costa psdc1...@gmail.com
 wrote:
  Hi,
 
  When a cluster has several queues, the JobTracker has to manage all
  clusters?
 
  --
  Best regards,



 --
 Harsh J




-- 
Best regards,


HDFS counters

2013-05-24 Thread Pedro da Costa
I am analyzing some HDFS counters, and I have these questions?

1 - The HDFS: Number of bytes read as long as the map tasks read data
from the HDFS, or it is a pre-calculated sum before the mappers start to
read?

2 - With these metrics, it was written some data in the HDFS before the map
tasks start. Does anyone have an opinion if it is possible the map tasks
write the intermediate output in thi HDFS? This happens because this job
defined by the user forces to (I don't know what this job does)?

mapcompletionmap() completion: 0.9946828/mapcompletion
redcompletionreduce() completion: 0.0/redcompletion
hdfsHDFS: Number of bytes read=314470180/hdfs
hdfsHDFS: Number of bytes written=313912087/hdfs

-- 
Best regards,


hadoop queue -list

2013-05-22 Thread Pedro da Costa
1 - I am looking to the queue list in my system, and I have several queues
defined. And, in one of the queue I have this info:

Scheduling Info : Capacity: 1.0, MaximumCapacity: 1.0, CurrentCapacity:
77.534035

Why the current capacity is much bigger than the maximum capacity?


2 - With the queue info I can know if there is space to put more jobs
running?



-- 
Best regards,


Mapreduce queues

2013-05-22 Thread Pedro da Costa
Hi,

When a cluster has several queues, the JobTracker has to manage all
clusters?

-- 
Best regards,


job -list parameters

2013-05-21 Thread Pedro da Costa
When I list all the jobs running, I get several parameters related to the
job. What are the parameters asked below?


JobId
State
StartTime
UserName
Queue
Priority
UsedContainers   -- what is this parameter?
RsvdContainers   -- what is this parameter?
UsedMem
RsvdMem   -- what is this parameter?
NeededMem
AM info

-- 
Best regards,


Combine data from different HDFS FS

2013-04-08 Thread Pedro da Costa
Hi,

I want to combine the data that are in different HDFS filesystems, for them
to be executed in one job. Is it possible to do this with MR, or there is
another Apache tool that allows me to do this?

Eg.

Hdfs data in Cluster1 v
Hdfs data in Cluster2 - this job reads the data from Cluster1, 2


Thanks,
-- 
Best regards,


Re: Combine data from different HDFS FS

2013-04-08 Thread Pedro da Costa
I'm invoking the wordcount example in host1 with this command, but I got an
error.


HOST1:$ bin/hadoop jar hadoop-examples-1.0.4.jar wordcount
hdfs://HOST2:54310/gutenberg gutenberg-output

13/04/08 22:02:55 ERROR security.UserGroupInformation:
PriviledgedActionException as:ubuntu
cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
path does not exist: hdfs://HOST2:54310/gutenberg
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
does not exist: hdfs://HOST2:54310/gutenberg

Can you be more specific about using the FileinputFormat? It's because I've
configured MapReduce and HDFS to work in HOST, and I don't know how can I
make an wordcount that reads the data from the HDFS from files in HOST1 and
HOST2?






On 8 April 2013 19:34, Harsh J ha...@cloudera.com wrote:

 You should be able to add fully qualified HDFS paths from N clusters
 into the same job via FileInputFormat.addInputPath(…) calls. Caveats
 may apply for secure environments, but for non-secure mode this should
 work just fine. Did you try this and did it not work?

 On Mon, Apr 8, 2013 at 9:56 PM, Pedro Sá da Costa psdc1...@gmail.com
 wrote:
  Hi,
 
  I want to combine the data that are in different HDFS filesystems, for
 them
  to be executed in one job. Is it possible to do this with MR, or there is
  another Apache tool that allows me to do this?
 
  Eg.
 
  Hdfs data in Cluster1 v
  Hdfs data in Cluster2 - this job reads the data from Cluster1, 2
 
 
  Thanks,
  --
  Best regards,



 --
 Harsh J




-- 
Best regards,


Re: Combine data from different HDFS FS

2013-04-08 Thread Pedro da Costa
Maybe there is some FileInputFormat class that allows to define input files
from different locations. What I would like to know, is if it's possible to
read input data from different HDFS FS. E.g., run the wordcount with the
input files from HDFS FS in HOST1 and HOST2 (the FS in HOST1 and HOST2 are
distinct). Any suggestion on which InputFormat I should use?



On 9 April 2013 00:10, Pedro Sá da Costa psdc1...@gmail.com wrote:

 I'm invoking the wordcount example in host1 with this command, but I got
 an error.


 HOST1:$ bin/hadoop jar hadoop-examples-1.0.4.jar wordcount
 hdfs://HOST2:54310/gutenberg gutenberg-output

 13/04/08 22:02:55 ERROR security.UserGroupInformation:
 PriviledgedActionException as:ubuntu
 cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
 path does not exist: hdfs://HOST2:54310/gutenberg
 org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
 does not exist: hdfs://HOST2:54310/gutenberg

 Can you be more specific about using the FileinputFormat? It's because
 I've configured MapReduce and HDFS to work in HOST, and I don't know how
 can I make an wordcount that reads the data from the HDFS from files in
 HOST1 and HOST2?






 On 8 April 2013 19:34, Harsh J ha...@cloudera.com wrote:

 You should be able to add fully qualified HDFS paths from N clusters
 into the same job via FileInputFormat.addInputPath(…) calls. Caveats
 may apply for secure environments, but for non-secure mode this should
 work just fine. Did you try this and did it not work?

 On Mon, Apr 8, 2013 at 9:56 PM, Pedro Sá da Costa psdc1...@gmail.com
 wrote:
  Hi,
 
  I want to combine the data that are in different HDFS filesystems, for
 them
  to be executed in one job. Is it possible to do this with MR, or there
 is
  another Apache tool that allows me to do this?
 
  Eg.
 
  Hdfs data in Cluster1 v
  Hdfs data in Cluster2 - this job reads the data from Cluster1, 2
 
 
  Thanks,
  --
  Best regards,



 --
 Harsh J




 --
 Best regards,




-- 
Best regards,


set the namenode public IP address in amazon EC2?

2013-03-29 Thread Pedro da Costa
Hi,

I'm trying to configure the Namenode with  a public IP in amazon EC2. The
service always get the host private IP, and not the public one. How can I
set the namenode public IP address?

Here are my configuration files:

$cat hdfs-site.xml
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --
configuration
property namedfs.replication/name value3/value /property
property namefs.default.name/name valuehdfs://
ec2-46-XX.eu-west-1.compute.amazonaws.com:54310/value /property
property namedfs.data.dir/name
value/home/ubuntu/MRtmp/dfs/data/value /property
property namedfs.permissions/namevaluefalse/value/property
property
namedfs.permissions.enabled/namevaluefalse/value/property
propertynamedfs.datanode.data.dir.perm/namevalue777/value/property
/configuration

$ cat core-site.xml
?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?
!-- Put site-specific property overrides in this file. --

configuration
propertynamehadoop.tmp.dir/namevalue/home/ubuntu/MRtmp/dir/hadoop-${
user.name}/value/property
propertynamehadoop.backup.files/namevaluetrue/value/property
propertynamehadoop.tmp.bkp.dir/namevalue/home/ubuntu/MRtmp/backup/dir/hadoop-${
user.name}/value/property
propertynamefs.default.name/namevaluehdfs://
ec2-46-XXX.eu-west-1.compute.amazonaws.com:54310/value/property
propertynamehadoop.security.authentication/namevaluesimple/value/property
propertynamehadoop.security.authorization/namevaluefalse/value/property
/configuration


-- 
Best regards,


Is it possible to set FS permissions (e.g. 755) in hdfs-site.xml?

2013-03-28 Thread Pedro da Costa
Is it possible to set FS permissions (e.g. 755) in hdfs-site.xml?

-- 
Best regards,


FSDataOutputStream can write in a file in a remote host?

2013-03-28 Thread Pedro da Costa
FSDataOutputStream can write in a file in a remote host?


-- 
Best regards,


FSDataOutputStream hangs in out.close()

2013-03-27 Thread Pedro da Costa
Hi,

I'm using the Hadoop 1.0.4 API to try to submit a job in a remote
JobTracker. I created modfied the JobClient to submit the same job in
different JTs. E.g, the JobClient is in my PC and it try to submit the same
Job  in 2 JTs at different sites in Amazon EC2. When I'm launching the Job,
in the setup phase, the JobClient is trying to submit split file info into
the remote JT.  This is the method of the JobClient that I've the problem:


  public static void createSplitFiles(Path jobSubmitDir,
  Configuration conf, FileSystem   fs,
  org.apache.hadoop.mapred.InputSplit[] splits)
  throws IOException {
FSDataOutputStream out = createFile(fs,
JobSubmissionFiles.getJobSplitFile(jobSubmitDir), conf);
SplitMetaInfo[] info = writeOldSplits(splits, out, conf);
out.close();

writeJobSplitMetaInfo(fs,JobSubmissionFiles.getJobSplitMetaFile(jobSubmitDir),
new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION),
splitVersion,
info);
  }

1 - The FSDataOutputStream hangs in the out.close() instruction. Why it
hangs? What should I do to solve this?


-- 
Best regards,


Re: FSDataOutputStream hangs in out.close()

2013-03-27 Thread Pedro da Costa
Hi,

I'm trying to make the same client to talk to different HDFS and JT
instances that are in different sites of Amazon EC2. The error that I got
is:

 java.io.IOException: Got error for OP_READ_BLOCK,
self=/XXX.XXX.XXX.123:44734,

 remote=ip-XXX-XXX-XXX-123.eu-west-1.compute.internal/XXX.XXX.XXX.123:50010,
for file

 
ip-XXX-XXX-XXX-123.eu-west-1.compute.internal/XXX.XXX.XXX.123:50010:-4664365259588027316,
for block
   -4664365259588027316_2050

This error means than it wasn't possible to write on a remote host?





On 27 March 2013 12:24, Harsh J ha...@cloudera.com wrote:

 You can try to take a jstack stack trace and see what its hung on.
 I've only ever noticed a close() hang when the NN does not accept the
 complete-file call (due to minimum replication not being guaranteed),
 but given your changes (which I haven't an idea about yet) it could be
 something else as well. You're essentially trying to make the same
 client talk to two different FSes I think (aside of the JT RPC).

 On Wed, Mar 27, 2013 at 5:50 PM, Pedro Sá da Costa psdc1...@gmail.com
 wrote:
  Hi,
 
  I'm using the Hadoop 1.0.4 API to try to submit a job in a remote
  JobTracker. I created modfied the JobClient to submit the same job in
  different JTs. E.g, the JobClient is in my PC and it try to submit the
 same
  Job  in 2 JTs at different sites in Amazon EC2. When I'm launching the
 Job,
  in the setup phase, the JobClient is trying to submit split file info
 into
  the remote JT.  This is the method of the JobClient that I've the
 problem:
 
 
public static void createSplitFiles(Path jobSubmitDir,
Configuration conf, FileSystem   fs,
org.apache.hadoop.mapred.InputSplit[] splits)
throws IOException {
  FSDataOutputStream out = createFile(fs,
  JobSubmissionFiles.getJobSplitFile(jobSubmitDir), conf);
  SplitMetaInfo[] info = writeOldSplits(splits, out, conf);
  out.close();
 
 
 writeJobSplitMetaInfo(fs,JobSubmissionFiles.getJobSplitMetaFile(jobSubmitDir),
  new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION),
  splitVersion,
  info);
}
 
  1 - The FSDataOutputStream hangs in the out.close() instruction. Why it
  hangs? What should I do to solve this?
 
 
  --
  Best regards,



 --
 Harsh J




-- 
Best regards,


Re: FSDataOutputStream hangs in out.close()

2013-03-27 Thread Pedro da Costa
I can add this information taken from the datanode logs, but it seems
something related to blocks:

nfoPort=50075, ipcPort=50020):Got exception while serving
blk_-4664365259588027316_2050 to /XXX.XXX.XXX.123:
java.io.IOException: Block blk_-4664365259588027316_2050 is not valid.
at
org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:1072)
at
org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:1035)
at
org.apache.hadoop.hdfs.server.datanode.FSDataset.getVisibleLength(FSDataset.java:1045)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.init(BlockSender.java:94)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
at java.lang.Thread.run(Thread.java:662)

2013-03-27 15:44:54,965 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(XXX.XXX.XXX.123:50010,
storageID=DS-595468034-XXX.XXX.XXX.123-50010-1364122596021, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.IOException: Block blk_-4664365259588027316_2050 is not valid.
at
org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:1072)
at
org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:1035)
at
org.apache.hadoop.hdfs.server.datanode.FSDataset.getVisibleLength(FSDataset.java:1045)
at
org.apache.hadoop.hdfs.server.datanode.BlockSender.init(BlockSender.java:94)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
at java.lang.Thread.run(Thread.java:662)

I still have no idea why this error, if the 2 HDFS instances have the same
data.


On 27 March 2013 15:53, Pedro Sá da Costa psdc1...@gmail.com wrote:

 Hi,

 I'm trying to make the same client to talk to different HDFS and JT
 instances that are in different sites of Amazon EC2. The error that I got
 is:

  java.io.IOException: Got error for OP_READ_BLOCK,
 self=/XXX.XXX.XXX.123:44734,

  remote=ip-XXX-XXX-XXX-123.eu-west-1.compute.internal/XXX.XXX.XXX.123:50010,
 for file

  
 ip-XXX-XXX-XXX-123.eu-west-1.compute.internal/XXX.XXX.XXX.123:50010:-4664365259588027316,
 for block
-4664365259588027316_2050

 This error means than it wasn't possible to write on a remote host?





 On 27 March 2013 12:24, Harsh J ha...@cloudera.com wrote:

 You can try to take a jstack stack trace and see what its hung on.
 I've only ever noticed a close() hang when the NN does not accept the
 complete-file call (due to minimum replication not being guaranteed),
 but given your changes (which I haven't an idea about yet) it could be
 something else as well. You're essentially trying to make the same
 client talk to two different FSes I think (aside of the JT RPC).

 On Wed, Mar 27, 2013 at 5:50 PM, Pedro Sá da Costa psdc1...@gmail.com
 wrote:
  Hi,
 
  I'm using the Hadoop 1.0.4 API to try to submit a job in a remote
  JobTracker. I created modfied the JobClient to submit the same job in
  different JTs. E.g, the JobClient is in my PC and it try to submit the
 same
  Job  in 2 JTs at different sites in Amazon EC2. When I'm launching the
 Job,
  in the setup phase, the JobClient is trying to submit split file info
 into
  the remote JT.  This is the method of the JobClient that I've the
 problem:
 
 
public static void createSplitFiles(Path jobSubmitDir,
Configuration conf, FileSystem   fs,
org.apache.hadoop.mapred.InputSplit[] splits)
throws IOException {
  FSDataOutputStream out = createFile(fs,
  JobSubmissionFiles.getJobSplitFile(jobSubmitDir), conf);
  SplitMetaInfo[] info = writeOldSplits(splits, out, conf);
  out.close();
 
 
 writeJobSplitMetaInfo(fs,JobSubmissionFiles.getJobSplitMetaFile(jobSubmitDir),
  new FsPermission(JobSubmissionFiles.JOB_FILE_PERMISSION),
  splitVersion,
  info);
}
 
  1 - The FSDataOutputStream hangs in the out.close() instruction. Why it
  hangs? What should I do to solve this?
 
 
  --
  Best regards,



 --
 Harsh J




 --
 Best regards,




-- 
Best regards,


Re: FSDataOutputStream hangs in out.close()

2013-03-27 Thread Pedro da Costa
I just create 2 different FS instances.

On Wednesday, 27 March 2013, Harsh J wrote:

 Same data does not mean same block IDs across two clusters. I'm
 guessing this is cause of some issue in your code when wanting to
 write to two different HDFS instances with the same client. Did you do
 a low level mod for HDFS writes as well or just create two different
 FS instances when you want to write to different ones?

 On Wed, Mar 27, 2013 at 9:34 PM, Pedro Sá da Costa psdc1...@gmail.com
 wrote:
  I can add this information taken from the datanode logs, but it seems
  something related to blocks:
 
  nfoPort=50075, ipcPort=50020):Got exception while serving
  blk_-4664365259588027316_2050 to /XXX.XXX.XXX.123:
  java.io.IOException: Block blk_-4664365259588027316_2050 is not valid.
  at
 
 org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:1072)
  at
 
 org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:1035)
  at
 
 org.apache.hadoop.hdfs.server.datanode.FSDataset.getVisibleLength(FSDataset.java:1045)
  at
 
 org.apache.hadoop.hdfs.server.datanode.BlockSender.init(BlockSender.java:94)
  at
 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
  at
 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
  at java.lang.Thread.run(Thread.java:662)
 
  2013-03-27 15:44:54,965 ERROR
  org.apache.hadoop.hdfs.server.datanode.DataNode:
  DatanodeRegistration(XXX.XXX.XXX.123:50010,
  storageID=DS-595468034-XXX.XXX.XXX.123-50010-1364122596021,
 infoPort=50075,
  ipcPort=50020):DataXceiver
  java.io.IOException: Block blk_-4664365259588027316_2050 is not valid.
  at
 
 org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:1072)
  at
 
 org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:1035)
  at
 
 org.apache.hadoop.hdfs.server.datanode.FSDataset.getVisibleLength(FSDataset.java:1045)
  at
 
 org.apache.hadoop.hdfs.server.datanode.BlockSender.init(BlockSender.java:94)
  at
 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
  at
 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
  at java.lang.Thread.run(Thread.java:662)
 
  I still have no idea why this error, if the 2 HDFS instances have the
 same
  data.
 
 
  On 27 March 2013 15:53, Pedro Sá da Costa psdc1...@gmail.com wrote:
 
  Hi,
 
  I'm trying to make the same client to talk to different HDFS and JT
  instances that are in different sites of Amazon EC2. The error that I
 got
  is:
 
   java.io.IOException: Got error for OP_READ_BLOCK,
  self=/XXX.XXX.XXX.123:44734,
 
 
 remote=ip-XXX-XXX-XXX-123.eu-west-1.compute.internal/XXX.XXX.XXX.123:50010,
  for file
 
 
 ip-XXX-XXX-XXX-123.eu-west-1.compute.internal/XXX.XXX.XXX.123:50010:-4664365259588027316,
  for block
 -4664365259588027316_2050
 
  This error means than it wasn't possible to write on a remote host?
 
 
 
 
 
  On 27 March 2013 12:24, Harsh J ha...@cloudera.com wrote:
 
  You can try to take a jstack stack trace and see what its hung on.
  I've only ever noticed a close() hang when the NN does not accept the
  complete-file call (due to minimum replication not being guaranteed),
  but given your changes (which I haven't an idea about yet) it could be
  something else as well. You're essentially trying to make the same
  client talk to two different FSes I think (aside of the JT RPC).
 
  On Wed, Mar 27, 2013 at 5:50 PM, Pedro Sá da Costa psdc1...@gmail.com
 
  wrote:
   Hi,
  
 --
 Harsh J



-- 
Best regards,


Re: is it possible to disable security in MapReduce to avoid having PriviledgedActionException?

2013-03-25 Thread Pedro da Costa
This is my error (stacktrace below). It cannot
find org.apache.hadoop.security.KerberosName class. But the strange is that
I  have hadoop-core-1.0.4-SNAPSHOT.jar in the classpath, and the path to
the jar is correct. I've no idea what the problem is. Any help?

java.io.IOException: failure to login
at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:490)
 at
org.apache.hadoop.mapred.manager.DeferredScheduler$1.run(DeferredScheduler.java:80)
Caused by: javax.security.auth.login.LoginException:
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.hadoop.security.KerberosName
 at org.apache.hadoop.security.User.init(User.java:44)
at org.apache.hadoop.security.User.init(User.java:39)
 at
org.apache.hadoop.security.UserGroupInformation$HadoopLoginModule.commit(UserGroupInformation.java:130)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:769)
 at javax.security.auth.login.LoginContext.access$000(LoginContext.java:186)
at javax.security.auth.login.LoginContext$5.run(LoginContext.java:706)
 at java.security.AccessController.doPrivileged(Native Method)
at
javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:703)
 at javax.security.auth.login.LoginContext.login(LoginContext.java:576)
at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:471)


On 25 March 2013 02:11, Harsh J ha...@cloudera.com wrote:

 What is the exact error you're getting? Can you please paste with
 the full stack trace and your version in use?

 Many times the PriviledgedActionException is just a wrapper around the
 real cause and gets overlooked. It does not necessarily appear due to
 security code (whether security is enabled or disabled).

 In any case, if you meant to run MR with zero UGI.doAs (which will
 wrap with that exception) then no, thats not possible to do.

 On Mon, Mar 25, 2013 at 12:57 AM, Pedro Sá da Costa psdc1...@gmail.com
 wrote:
  Hi,
 
  is it possible to disable security in MapReduce to avoid having
  PriviledgedActionException?
 
  Thanks,
 



 --
 Harsh J




-- 
Best regards,


who runs the map and reduce tasks in the unit tests

2013-02-21 Thread Pedro da Costa
Hi,

In Hadoop MR unit tests, the classes uses the
./core/org/apache/hadoop/util/Tool.java, and
./core/org/apache/hadoop/util/ToolRunner.java tosubmit the job. But to
run the unit tests it seems that it's not needed the MR be running. If
so, who runs the map and reduce tasks?


-- 
Best regards,
P


configure mapreduce to work with pem files.

2013-02-13 Thread Pedro da Costa
I'm trying to configure ssh for the Hadoop mapreduce, but my nodes only
communicate with each others using RSA keys in pem format.

(It doesn't work)
ssh user@host
Permission denied (publickey).

(It works)
ssh -i ~/key.pem user@host

The nodes in mapreduce communicate using ssh. How I configure the ssh, or
the mapreduce to work with the pem file.


-- 
Best regards,
P


Re: configure mapreduce to work with pem files.

2013-02-13 Thread Pedro da Costa
So, why it is necessary to configure ssh in Hadoop MR?

On 13 February 2013 12:58, Harsh J ha...@cloudera.com wrote:

 Hi,

 Nodes in Hadoop do not communicate using SSH. See
 http://wiki.apache.org/hadoop/FAQ#Does_Hadoop_require_SSH.3F

 On Wed, Feb 13, 2013 at 5:16 PM, Pedro Sá da Costa psdc1...@gmail.com
 wrote:
  I'm trying to configure ssh for the Hadoop mapreduce, but my nodes only
  communicate with each others using RSA keys in pem format.
 
  (It doesn't work)
  ssh user@host
  Permission denied (publickey).
 
  (It works)
  ssh -i ~/key.pem user@host
 
  The nodes in mapreduce communicate using ssh. How I configure the ssh, or
  the mapreduce to work with the pem file.
 
 
  --
  Best regards,
  P



 --
 Harsh J




-- 
Best regards,


Re: Save configuration data in job configuration file.

2013-01-20 Thread Pedro da Costa
This does not save in the xml file. I think this just keep the
variable in memory.

On 19 January 2013 18:48, Arun C Murthy a...@hortonworks.com wrote:
 jobConf.set(String, String)?




-- 
Best regards,


Save configuration data in job configuration file.

2013-01-19 Thread Pedro da Costa
Hi

I want to save some configuration data in the configuration files that
belongs to the job. How can I do it?

-- 
Best regards,


Re: When reduce tasks start in MapReduce Streaming?

2013-01-16 Thread Pedro da Costa
So why it's called hadoop streaming, if it doesn't behave like a
streaming application (The reduces don't receive data as long as it is
produced by the map tasks)?


On 16 January 2013 05:41, Jeff Bean jwfb...@cloudera.com wrote:
 me property. The reduce method is not called until the mappers are done, and
 the reducers are not scheduled before the threshold set by
 mapred.reduce.slowstart.completed.maps is reached.




-- 
Best regards,


When reduce tasks start in MapReduce Streaming?

2013-01-15 Thread Pedro da Costa
Hi,

I read from documents that in MapReduce, the reduce tasks only start
after a percentage (by default 90%) of maps end. This means that the
slowest maps can delay the start of reduce tasks, and the input data
that is consumed by the reduce tasks is represented as a batch of
data. This means that, the scenario of having reduce tasks consuming
data as long the map tasks produce it, doesn't exist. But with the in
Hadoop MapReduce streaming this still happens?

-- 
Best regards,
P


Map tasks allocation in reduce slots?

2012-12-29 Thread Pedro da Costa
MapReduce framework has map and reduce slots, that are used to track which
tasks are running. When map tasks are just running, the reduce slots that
the job have will be filled by map tasks?

-- 
Best regards,


Profiler in Hadoop MapReduce

2012-12-15 Thread Pedro da Costa
Hi

I want to attach jprofiler to Hadoop MapReduce (MR). DO I need to
configure MR to open porsts for the Jobtracker, tasktracker and map
and reduce tasks to where I can attach jprofiler?


-- 
Best regards,


Map output files and partitions.

2012-12-13 Thread Pedro da Costa
Hi,

There only 2 types of map output files, Sequence and Text files. If
those files are going to be used as input to several reduce tasks,
they need to be partitioned into blocks. Is there any SEPARATOR bits
that limits each partition? Can I read a specific partition of a map
output file? Is there an API for that?

-- 
Best regards,


Get job, map and reduce times with RunningJob API.

2012-11-29 Thread Pedro da Costa
Hi,

I want to know when the job and map and reduce tasks started and ended
in a job using the RunningJob API. How can I get this information?

Thanks,
-- 
Best regards,


Re: Get job, map and reduce times with RunningJob API.

2012-11-29 Thread Pedro da Costa
For that I think I must access the JobTracker to get the TaskReports.
But how can I access the JobTracker server in a java class?

For JSP, you just need this instruction final JobTracker tracker =
(JobTracker) application.getAttribute(job.tracker);, but for Java
what I need?


On 29 November 2012 11:12, Pedro Sá da Costa psdc1...@gmail.com wrote:
 Hi,

 I want to know when the job and map and reduce tasks started and ended
 in a job using the RunningJob API. How can I get this information?

 Thanks,
 --
 Best regards,



-- 
Best regards,


Job progress in bash

2012-11-28 Thread Pedro da Costa
Hadoop Mapreduce has an web interface that shows the progress of
running jobs. Can I get the same information about the job progress in
bash? There's a program to print the progress in the terminal?

Thanks,

-- 
Best regards,


Re: Job progress in bash

2012-11-28 Thread Pedro da Costa
Yes I can, but I want more details about the tasks, like the time that
they start, ended, the duration of the shuffle. I want as much
information as the hadoop job history all command can give it, but I
want as the job progress.



On 28 November 2012 11:32, Harsh J ha...@cloudera.com wrote:
 hadoop job -status



-- 
Best regards,


Get JobInProgress given jobId

2012-11-28 Thread Pedro da Costa
I'm building a Java class and given a JobID, how can I get the
JobInProgress? Can anyone give me an example?

-- 
Best regards,


Re: Get JobInProgress given jobId

2012-11-28 Thread Pedro da Costa
I have the jobId as a String, and from that I want to access the
RunningJob API for that jobId. I think that it is only possible to
access this API through the JobInProgress class, but maybe I'm wrong.
Is this true?


On 28 November 2012 17:24, Mahesh Balija balijamahesh@gmail.com wrote:
 Hi Pedro,

   You can get the JobInProgress instance from JobTracker.
  JobInProgress getJob(JobID jobid);

 Best,
 Mahesh Balija,
 Calsoft Labs.

 On Wed, Nov 28, 2012 at 10:41 PM, Pedro Sá da Costa psdc1...@gmail.com
 wrote:

 I'm building a Java class and given a JobID, how can I get the
 JobInProgress? Can anyone give me an example?

 --
 Best regards,





-- 
Best regards,


Re: Get JobInProgress given jobId

2012-11-28 Thread Pedro da Costa
On 28 November 2012 18:12, Harsh J ha...@cloudera.com wrote:
 nt application's hadoop jar same version as the server?

Yes it is.

 2. Is the port 54311 the proper JobTracker port?

This jobtracker port is set to:

property
  namemapred.job.tracker/name
  valuelocalhost:54311/value
  !--valuelocal/value--
  descriptionThe host and port that the MapReduce job tracker runs
at.  If local, then jobs are run in-process as a single map and
reduce task.
  /description
/property




-- 
Best regards,


Re: Get JobInProgress given jobId

2012-11-28 Thread Pedro da Costa
I've this error in jobtracker log. Maybe this is the reason. What this
error means?

2012-11-28 19:19:17,697 WARN org.apache.hadoop.ipc.Server: Incorrect
header or version mismatch from 127.0.0.1:60089 got version 4 expected
version 3



On 28 November 2012 18:28, Pedro Sá da Costa psdc1...@gmail.com wrote:
 On 28 November 2012 18:12, Harsh J ha...@cloudera.com wrote:
 nt application's hadoop jar same version as the server?

 Yes it is.

 2. Is the port 54311 the proper JobTracker port?

 This jobtracker port is set to:

 property
   namemapred.job.tracker/name
   valuelocalhost:54311/value
   !--valuelocal/value--
   descriptionThe host and port that the MapReduce job tracker runs
 at.  If local, then jobs are run in-process as a single map and
 reduce task.
   /description
 /property




 --
 Best regards,



-- 
Best regards,


Re: Get JobInProgress given jobId

2012-11-28 Thread Pedro da Costa
And for this error, after all maybe I'm running hadoop.jar with
different versions. I'm running hadoop-0.20 and trying to run
JobClient in with hadoo-1.0

On 28 November 2012 19:20, Pedro Sá da Costa psdc1...@gmail.com wrote:
 I've this error in jobtracker log. Maybe this is the reason. What this
 error means?

 2012-11-28 19:19:17,697 WARN org.apache.hadoop.ipc.Server: Incorrect
 header or version mismatch from 127.0.0.1:60089 got version 4 expected
 version 3



 On 28 November 2012 18:28, Pedro Sá da Costa psdc1...@gmail.com wrote:
 On 28 November 2012 18:12, Harsh J ha...@cloudera.com wrote:
 nt application's hadoop jar same version as the server?

 Yes it is.

 2. Is the port 54311 the proper JobTracker port?

 This jobtracker port is set to:

 property
   namemapred.job.tracker/name
   valuelocalhost:54311/value
   !--valuelocal/value--
   descriptionThe host and port that the MapReduce job tracker runs
 at.  If local, then jobs are run in-process as a single map and
 reduce task.
   /description
 /property




 --
 Best regards,



 --
 Best regards,



-- 
Best regards,


Shoud I use MapReduce 0.2X, or 1.0?

2012-10-31 Thread Pedro da Costa
I've noticed that Hadoop MapReduce 1.0.4 was released in 12 October 2012,
and Hadoop 0.23.4 was released in 15 October 2012. I thought that with
Hadoop 1.0, the Hadoop 0.2X had become discontinued. If I want to start to
use Hadoop MapReduce, which version should I use?


What's the difference between Hadoop MapReduce 0.2X and Hadoop MapReduce
1.0?


-- 
Best regards,


Cannot run program autoreconf

2012-09-25 Thread Pedro da Costa
I'm trying to compile the mapreduce, but I get the error:

create-native-configure:

BUILD FAILED
/home/xeon/Projects/hadoop-1.0.3/build.xml:618: Execute failed:
java.io.IOException: Cannot run program autoreconf (in directory
/home/xeon/Projects/hadoop-1.0.3/src/native): java.io.IOException:
error=2, No such file or directory

What this error means?


-- 
Best regards,


SecretKey in MapReduce

2012-09-20 Thread Pedro da Costa
Hi,

- Hadoop 1.0.2 uses a SecretKey, but I don't understand what's the purpose
of that. Can anyone explain what's the purpose of the SecretKey?

- Is this Secret key shared between JobTracker, TaskTrackers, and Map and
Reduce tasks?


-- 
Best regards,


splits and maps

2012-09-19 Thread Pedro da Costa
If I've an input  file of 640MB in size, and a split size of 64Mb, this
file will be partitioned in 10 splits, and each split will be processed by
a map task, right?

-- 
Best regards,


How map tasks know which is in the input file?

2012-08-15 Thread Pedro da Costa
Hi,

1 - In JobTracker in the Hadoop Mapreduce 1.0.3, there's a new JobToken.
What's the purpose of the JobToken?

2 - I also notice that the input files has now some metafiles. It seems
that the way how tasks get the input files for the map tasks are completely
different from what hadoop 0.20.0 do. With the new version the input file
name isn't given directly to the map tasks. Can someone give me an insight
how the map tasks know which is the file name and path of the input split?

-- 
Best regards,


submit a job in a remote jobtracker

2012-08-14 Thread Pedro da Costa
I want to submit a job in  a remove job tracker, how can I do it?

-- 
Best regards,


Re: submit a job in a remote jobtracker

2012-08-14 Thread Pedro da Costa
But this solution implies that a user must access the remote machine before
submit the job. This is not what I want. I want to submit a job in my local
machine, and it will be forwarded to the remote JobTracker.


On 14 August 2012 14:15, Harsh J ha...@cloudera.com wrote:

 Hi Pedro,

 This has been asked before. See
 http://search-hadoop.com/m/bikPd1LWhhB1 (or search more on that same
 site)

 On Tue, Aug 14, 2012 at 6:32 PM, Pedro Sá da Costa psdc1...@gmail.com
 wrote:
  I want to submit a job in  a remove job tracker, how can I do it?
 
  --
  Best regards,
 



 --
 Harsh J




-- 
Best regards,


Can't run Hadoop MR 1.0.3 Junit tests in IDE.

2012-08-10 Thread Pedro da Costa
Hi,

I'm trying to run Hadoop Junit tests in IDE, but I got errors. I've the
mapreduce running properly. I'm using the version Hadoop 1.0.3:
*
*
2012-08-10 13:53:05,803 ERROR mapred.MiniMRCluster
(MiniMRCluster.java:run(119)) - Job tracker crashed
java.lang.NullPointerException
at java.io.File.init(File.java:239)
 at org.apache.hadoop.mapred.JobHistory.initLogDir(JobHistory.java:531)
at org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:499)
 at org.apache.hadoop.mapred.JobTracker$2.run(JobTracker.java:2330)
at org.apache.hadoop.mapred.JobTracker$2.run(JobTracker.java:2327)
 at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:2327)
 at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:2188)
at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:2182)
 at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:296)
at
org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner$1.run(MiniMRCluster.java:114)
 at
org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner$1.run(MiniMRCluster.java:1)
at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 at
org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner.run(MiniMRCluster.java:112)
at java.lang.Thread.run(Thread.java:679)
2012-08-10 13:53:05,895 INFO  mapred.MiniMRCluster
(MiniMRCluster.java:init(188)) - mapred.local.dir is
/home/xeon/workspace/hadoop-1.0.3-tests/build/test/mapred/local/0_0
2012-08-10 13:53:10,923 INFO  http.HttpServer
(HttpServer.java:addGlobalFilter(411)) - Added global filtersafety
(class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
2012-08-10 13:53:10,941 INFO  mapred.TaskLogsTruncater
(TaskLogsTruncater.java:init(72)) - Initializing logs' truncater with
mapRetainSize=-1 and reduceRetainSize=-1
2012-08-10 13:53:10,947 INFO  mapred.TaskTracker
(TaskTracker.java:initialize(694)) - Starting tasktracker with owner as xeon
2012-08-10 13:53:10,948 INFO  mapred.TaskTracker
(TaskTracker.java:initialize(710)) - Good mapred local directories are:
/home/xeon/workspace/hadoop-1.0.3-tests/build/test/mapred/local/0_0
2012-08-10 13:53:10,959 WARN  util.NativeCodeLoader
(NativeCodeLoader.java:clinit(52)) - Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
2012-08-10 13:53:10,977 INFO  ipc.Server (Server.java:run(328)) - Starting
SocketReader
2012-08-10 13:53:10,979 INFO  ipc.Server (Server.java:run(598)) - IPC
Server Responder: starting
2012-08-10 13:53:10,979 INFO  ipc.Server (Server.java:run(434)) - IPC
Server listener on 58393: starting
2012-08-10 13:53:10,983 INFO  ipc.Server (Server.java:run(1358)) - IPC
Server handler 0 on 58393: starting
2012-08-10 13:53:10,984 INFO  ipc.Server (Server.java:run(1358)) - IPC
Server handler 1 on 58393: starting
2012-08-10 13:53:10,984 INFO  ipc.Server (Server.java:run(1358)) - IPC
Server handler 2 on 58393: starting
2012-08-10 13:53:10,987 INFO  ipc.Server (Server.java:run(1358)) - IPC
Server handler 3 on 58393: starting
2012-08-10 13:53:10,987 INFO  mapred.TaskTracker
(TaskTracker.java:initialize(794)) - TaskTracker up at:
localhost.localdomain/127.0.0.1:58393
2012-08-10 13:53:10,988 INFO  mapred.TaskTracker
(TaskTracker.java:initialize(797)) - Starting tracker tracker_host0.foo.com:
localhost.localdomain/127.0.0.1:58393
2012-08-10 13:53:12,050 INFO  ipc.Client
(Client.java:handleConnectionFailure(666)) - Retrying connect to server:
localhost/127.0.0.1:0. Already tried 0 time(s).
2012-08-10 13:53:13,051 INFO  ipc.Client
(Client.java:handleConnectionFailure(666)) - Retrying connect to server:
localhost/127.0.0.1:0. Already tried 1 time(s).
2012-08-10 13:53:14,053 INFO  ipc.Client
(Client.java:handleConnectionFailure(666)) - Retrying connect to server:
localhost/127.0.0.1:0. Already tried 2 time(s).


How do I run Hadoop Junit tests properly?


Thanks,
*
*
*
*

-- 
Best regards,