hadoop input buffer size

2011-10-04 Thread Mark question
Hello,

  Correct me if I'm wrong, but when a program opens n-files at the same time
to read from, and start reading from each file at a time 1 line at a time.
Isn't hadoop actually fetching dfs.block.size of lines into a buffer? and
not actually one line.

  If this is correct, I set up my dfs.block.size = 3MB and each line takes
about 650 bytes only, then I would assume the performance for reading 1-4000
lines would be the same, but it isn't !  Do you know a way to find #n of
lines to be read at once?

Thank you,
Mark


Re: Monitoring Slow job.

2011-10-04 Thread patrick sang
Hi all,
Thanks for Vitthai your reply leads me to the solution.
Here is what i do, hopefully it is considered contribution back to
community.

1. ./hadoop job -list all |awk '{ if($2==1) print $1 }'
---> to get the list of running JobID

2. ./hadoop job -status JobID
--> file: hdfs://xxx.xx.xxx/zz/aa/job.xml

3. ./hadoop fs -cat hdfs://xxx.xx.xxx/zz/aa/job.xml
here we got mapred.output.dir

4. ./hadoop job -history 
--> Launched At:

5. Get the time diff between "launched at" and now.

I would be much easier to just scrape the
http://jobtracker:50030/jobdetails.jsp?jobid=jobId
or there might be some more elegant way of getting this. It's just how i did
it this time.

Cheers,
P

On Mon, Oct 3, 2011 at 5:41 PM, Vitthal "Suhas" Gogate <
gog...@hortonworks.com> wrote:

> I am not sure there is a easy way to get what you want on command line..
> one
> option is to use following command which would give you verbose job history
> where you can find submit, Launch & Finish time (including duration on
> FinishTime line).  I am using hadoop-0.20.205.0  branch. So check if you
> have some such option for the version of hadoop you are using...
>
> I am pasting sample output for my wordcount program,
>
> bin/hadoop job -history 
>
> ==
>
> horton-mac:hadoop-0.20.205.0 vgogate$ bin/hadoop job -history output
> Warning: $HADOOP_HOME is deprecated.
>
>
> Hadoop job: 0001_1317688277686_vgogate
> =
> Job tracker host name: job
> job tracker start time: Sun May 16 08:53:51 PDT 1976
> User: vgogate
> JobName: word count
> JobConf:
>
> hdfs://horton-mac.local:54310/tmp/mapred/staging/vgogate/.staging/job_201110031726_0001/job.xml
> Submitted At: 3-Oct-2011 17:31:17
> Launched At: 3-Oct-2011 17:31:17 (0sec)
> Finished At: 3-Oct-2011 17:31:50 (32sec)
> Status: SUCCESS
> Counters:
>
> |Group Name|Counter name  |Map Value
> |Reduce Value|Total Value|
>
> ---
> |Job Counters  |Launched reduce tasks |0
> |0 |1
> |Job Counters  |SLOTS_MILLIS_MAPS |0
> |0 |12,257
> |Job Counters  |Total time spent by all reduces waiting
> after reserving slots (ms)|0 |0 |0
> |Job Counters  |Total time spent by all maps waiting after
> reserving slots (ms)|0 |0 |0
> |Job Counters  |Launched map tasks|0
> |0 |1
> |Job Counters  |Data-local map tasks  |0
> |0 |1
> |Job Counters  |SLOTS_MILLIS_REDUCES  |0
> |0 |10,082
> |File Output Format Counters   |Bytes Written |0
> |61,192|61,192
> |FileSystemCounters|FILE_BYTES_READ   |0
> |70,766|70,766
> |FileSystemCounters|HDFS_BYTES_READ   |112,056
> |0 |112,056
> |FileSystemCounters|FILE_BYTES_WRITTEN|92,325
> |92,294|184,619
> |FileSystemCounters|HDFS_BYTES_WRITTEN|0
> |61,192|61,192
> |File Input Format Counters|Bytes Read|111,933
> |0 |111,933
> |Map-Reduce Framework  |Reduce input groups   |0
> |2,411 |2,411
> |Map-Reduce Framework  |Map output materialized bytes |70,766
> |0 |70,766
> |Map-Reduce Framework  |Combine output records|2,411
> |0 |2,411
> |Map-Reduce Framework  |Map input records |2,643
> |0 |2,643
> |Map-Reduce Framework  |Reduce shuffle bytes  |0
> |0 |0
> |Map-Reduce Framework  |Reduce output records |0
> |2,411 |2,411
> |Map-Reduce Framework  |Spilled Records   |2,411
> |2,411 |4,822
> |Map-Reduce Framework  |Map output bytes  |120,995
> |0 |120,995
> |Map-Reduce Framework  |Combine input records |5,849
> |0 |5,849
> |Map-Reduce Framework  |Map output records|5,849
> |0 |5,849
> |Map-Reduce Framework  |SPLIT_RAW_BYTES   |123
> |0 |123
> |Map-Reduce Framework  |Reduce input records  |0
> |2,411 |2,411
> =
>
> Task Summary
> 
> KindTotalSuccessfulFailedKilledStartTimeFinishTime
>
> Setup11003-Oct-2011 17:31:203-Oct-2011 17:31:24
> (4sec)
> Map11003-Oct-2011 17:31:263-Oct-2011 17:31:30
> (4sec)
> Reduce11003-Oct-2011 17:31:323-Oct-2011
> 17:31:42
> (10sec)
> Cleanup11003-Oct-2011 17:31:443-Oct-2011
> 17:31:48 (4sec)
> 
>
>
> Analysis
> =
>
> Time taken by best performing map task task_201110031726_0001_

Re: ways to expand hadoop.tmp.dir capacity?

2011-10-04 Thread Meng Mao
I just read this:

MapReduce performance can also be improved by distributing the temporary
data generated by MapReduce tasks across multiple disks on each machine:

  
mapred.local.dir

/d1/mapred/local,/d2/mapred/local,/d3/mapred/local,/d4/mapred/local
true

  

Given that the default value is ${hadoop.tmp.dir}/mapred/local, would the
expanded capacity we're looking for be as easily accomplished as by defining
mapred.local.dir to span multiple disks? Setting aside the issue of temp
files so big that they could still fill a whole disk.

On Wed, Oct 5, 2011 at 1:32 AM, Meng Mao  wrote:

> Currently, we've got defined:
>   
>  hadoop.tmp.dir
>  /hadoop/hadoop-metadata/cache/
>   
>
> In our experiments with SOLR, the intermediate files are so large that they
> tend to blow out disk space and fail (and annoyingly leave behind their huge
> failed attempts). We've had issues with it in the past, but we're having
> real problems with SOLR if we can't comfortably get more space out of
> hadoop.tmp.dir somehow.
>
> 1) It seems we never set *mapred.system.dir* to anything special, so it's
> defaulting to ${hadoop.tmp.dir}/mapred/system.
> Is this a problem? The docs seem to recommend against it when
> hadoop.tmp.dir had ${user.name} in it, which ours doesn't.
>
> 1b) The doc says mapred.system.dir is "the in-HDFS path to shared MapReduce
> system files." To me, that means there's must be 1 single path for
> mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
> Otherwise, one might imagine that you could specify multiple paths to store
> hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
> interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if
> there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir?
>
> 2) IIRC, there's a -D switch for supplying config name/value pairs into
> indivdiual jobs. Does such a switch exist? Googling for single letters is
> fruitless. If we had a path on our workers with more space (in our case,
> another hard disk), could we simply pass that path in as hadoop.tmp.dir for
> our SOLR jobs? Without incurring any consistency issues on future jobs that
> might use the SOLR output on HDFS?
>
>
>
>
>


ways to expand hadoop.tmp.dir capacity?

2011-10-04 Thread Meng Mao
Currently, we've got defined:
  
 hadoop.tmp.dir
 /hadoop/hadoop-metadata/cache/
  

In our experiments with SOLR, the intermediate files are so large that they
tend to blow out disk space and fail (and annoyingly leave behind their huge
failed attempts). We've had issues with it in the past, but we're having
real problems with SOLR if we can't comfortably get more space out of
hadoop.tmp.dir somehow.

1) It seems we never set *mapred.system.dir* to anything special, so it's
defaulting to ${hadoop.tmp.dir}/mapred/system.
Is this a problem? The docs seem to recommend against it when hadoop.tmp.dir
had ${user.name} in it, which ours doesn't.

1b) The doc says mapred.system.dir is "the in-HDFS path to shared MapReduce
system files." To me, that means there's must be 1 single path for
mapred.system.dir, which sort of forces hadoop.tmp.dir to be 1 path.
Otherwise, one might imagine that you could specify multiple paths to store
hadoop.tmp.dir, like you can for dfs.data.dir. Is this a correct
interpretation? -- hadoop.tmp.dir could live on multiple paths/disks if
there were more mapping/lookup between mapred.system.dir and hadoop.tmp.dir?

2) IIRC, there's a -D switch for supplying config name/value pairs into
indivdiual jobs. Does such a switch exist? Googling for single letters is
fruitless. If we had a path on our workers with more space (in our case,
another hard disk), could we simply pass that path in as hadoop.tmp.dir for
our SOLR jobs? Without incurring any consistency issues on future jobs that
might use the SOLR output on HDFS?


Error using hadoop distcp

2011-10-04 Thread praveenesh kumar
I am trying to use distcp to copy a file from one HDFS to another.

But while copying I am getting the following exception :

hadoop distcp hdfs://ub13:54310/user/hadoop/weblog
hdfs://ub16:54310/user/hadoop/weblog

11/10/05 10:41:01 INFO mapred.JobClient: Task Id :
attempt_201110031447_0005_m_07_0, Status : FAILED
java.net.UnknownHostException: unknown host: ub16
at org.apache.hadoop.ipc.Client$Connection.(Client.java:195)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:850)
at org.apache.hadoop.ipc.Client.call(Client.java:720)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy1.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:113)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:215)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:177)
at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at
org.apache.hadoop.mapred.FileOutputCommitter.setupJob(FileOutputCommitter.java:48)
at
org.apache.hadoop.mapred.OutputCommitter.setupJob(OutputCommitter.java:124)
at org.apache.hadoop.mapred.Task.runJobSetupTask(Task.java:835)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:296)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Its saying its not finding ub16. But the entry is there in /etc/hosts files.
I am able to ssh both the machines. Do I need password less ssh between
these two NNs ?
What can be the issue ? Any thing I am missing before using distcp ?

Thanks,
Praveenesh


Re: How do I diagnose IO bounded errors using the framework counters?

2011-10-04 Thread W.P. McNeill
Here's an even more basic question. I tried to figure out what
the FILE_BYTES_READ means by searching every file in the hadoop 0.20.203.0
installation for the string FILE_BYTES_READ installation by running

  find . -type f | xargs grep FILE_BYTES_READ

I only found this string in source files in vaidya contributor directory and
the tools/rumen directories. Nothing in the main source base.

Where in the source code are these counters created and updated?


Can we use Inheritance hierarchy to specify the outputvalue class for mapper which is also inputvalue class for the reducer ?

2011-10-04 Thread Anuja Kulkarni
Hi,

We have class hierarchy for output value for both mapper as well as reducer 
class   as parent (abstract class) , child1,child2,… 
 
We have mapper class which is 
specified with its outputvalue class as parent class ; the map function 
will emit  either  child1 or child2 depending on the logic we used ( 
and reducer is having its inputvalue class as parent class).
 
But we are getting error as  "java.io.IOException: Type mismatch in value from 
map: expected parent classreceived child class”
 So is this possible to specify parent class as output/inputvalue class in case 
of mapper as well as in reducer ; since we want to follow some object-oriented 
approach. What will be the correct way of achieving this ?


- Anuja

Re: setInt & getInt

2011-10-04 Thread Joey Echeverria
The Job class copies the Configuraiton that you pass in. You either
need to do your conf.setInt("number", 12345) before you create the Job
object or you need call job.getConfiguration().setInt("number",
12345).

-Joey

On Tue, Oct 4, 2011 at 12:28 PM, Ratner, Alan S (IS)
 wrote:
> I have no problem with Hadoop.mapred using JobConf to setInt integers and 
> pass them to my map(s) for getInt as shown in the first program below.  
> However, when I use Hadoop.mapreduce using Configuration to setInt these 
> values are invisible to my map's getInt's.  Please tell me what I am doing 
> wrong.  Thanks.
>
> Both programs expect to see a file with a line or two of text in a directory 
> named testIn.
>
> Alan Ratner
>
>
> This program uses JobConf and setInt/getInt and works fine.  It outputs:
> number = 12345 (from map)
>
>
> package cbTest;
>
> import java.io.*;
> import org.apache.hadoop.conf.*;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.*;
> import org.apache.hadoop.mapred.*;
> import org.apache.hadoop.util.*;
>
> public class ConfTest extends Configured implements Tool {
>
>            @SuppressWarnings("deprecation")
>            public static class MapClass extends MapReduceBase implements 
> Mapper {
>                        public static int number;
>
>                        public void configure(JobConf job) {
>                                    number = job.getInt("number",-999);
>                        }
>
>                        public void map(LongWritable key, Text t, 
> OutputCollector output,
>                                                Reporter reporter) throws 
> IOException {
>                                    System.out.println("number = " + number);
>                        }
>            }
>
>            @SuppressWarnings("deprecation")
>            public int run(String[] args) throws Exception {
>                        Path InputDirectory = new Path("testIn");
>                        Path OutputDirectory = new Path("testOut");
>                        System.out.println(" Running ConfTest Program");
>                        JobConf conf = new JobConf(getConf(), ConfTest.class);
>                        conf.setInputFormat(TextInputFormat.class);
>                        conf.setOutputKeyClass(Text.class);
>                        conf.setOutputValueClass(Text.class);
>                        conf.setMapperClass(MapClass.class);
>                        conf.setInt("number", 12345);
>                        FileInputFormat.setInputPaths(conf, InputDirectory);
>                        FileOutputFormat.setOutputPath(conf, OutputDirectory);
>                        FileSystem fs = OutputDirectory.getFileSystem(conf);
>                        fs.delete(OutputDirectory, true); //remove output of 
> prior run
>                        JobClient.runJob(conf);
>                        return 0;
>            }
>
>            public static void main(String[] args) throws Exception {
>                        int res = ToolRunner.run(new Configuration(), new 
> ConfTest(), args);
>                        System.exit(res);
>            }
> }
>
>
> This program uses Configuration and setInt/getInt.  Butt getInt in neither 
> map or map:configure works.  It outputs:
>> Passing integer 12345 in configuration < (from run)
> map numbers: -999 -1 (from map as intMapConf and intConfConf)
>
>
> package cbTest;
>
> import java.io.IOException;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.conf.Configured;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.Job;
> import org.apache.hadoop.mapreduce.Mapper;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
> import org.apache.hadoop.util.Tool;
> import org.apache.hadoop.util.ToolRunner;
>
> public class Conf2Test extends Configured implements Tool
> {
>            public static class MapClass extends Mapper Text, Text>
>    {
>                        public int intConfConf = -1;
>
>                        public void configure(Configuration job) {
>                                    intConfConf = job.getInt("number", -2);
>                        }
>
>        public void map(LongWritable key, Text value, Context context) throws 
> IOException, InterruptedException
>        {
>                        int intMapConf = 
> context.getConfiguration().getInt("number", -999);
>                        System.out.println("map numbers: " + intMapConf+" 
> "+intConfConf);
>        }
>    }
>
>    public static void main(String[] args) throws Exception
>    {
>        int res = ToolRunner.run(new Configuration(), new Conf2Test(), args);
>        System.exit(res);
>
>    }
>
>            public int run(String[] arg0) throws Exception {
>       

setInt & getInt

2011-10-04 Thread Ratner, Alan S (IS)
I have no problem with Hadoop.mapred using JobConf to setInt integers and pass 
them to my map(s) for getInt as shown in the first program below.  However, 
when I use Hadoop.mapreduce using Configuration to setInt these values are 
invisible to my map's getInt's.  Please tell me what I am doing wrong.  Thanks.

Both programs expect to see a file with a line or two of text in a directory 
named testIn.

Alan Ratner


This program uses JobConf and setInt/getInt and works fine.  It outputs:
number = 12345 (from map)


package cbTest;

import java.io.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class ConfTest extends Configured implements Tool {

@SuppressWarnings("deprecation")
public static class MapClass extends MapReduceBase implements 
Mapper {
public static int number;

public void configure(JobConf job) {
number = job.getInt("number",-999);
}

public void map(LongWritable key, Text t, 
OutputCollector output,
Reporter reporter) throws 
IOException {
System.out.println("number = " + number);
}
}

@SuppressWarnings("deprecation")
public int run(String[] args) throws Exception {
Path InputDirectory = new Path("testIn");
Path OutputDirectory = new Path("testOut");
System.out.println(" Running ConfTest Program");
JobConf conf = new JobConf(getConf(), ConfTest.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(MapClass.class);
conf.setInt("number", 12345);
FileInputFormat.setInputPaths(conf, InputDirectory);
FileOutputFormat.setOutputPath(conf, OutputDirectory);
FileSystem fs = OutputDirectory.getFileSystem(conf);
fs.delete(OutputDirectory, true); //remove output of 
prior run
JobClient.runJob(conf);
return 0;
}

public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new 
ConfTest(), args);
System.exit(res);
}
}


This program uses Configuration and setInt/getInt.  Butt getInt in neither map 
or map:configure works.  It outputs:
> Passing integer 12345 in configuration < (from run)
map numbers: -999 -1 (from map as intMapConf and intConfConf)


package cbTest;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class Conf2Test extends Configured implements Tool
{
public static class MapClass extends Mapper
{
public int intConfConf = -1;

public void configure(Configuration job) {
intConfConf = job.getInt("number", -2);
}

public void map(LongWritable key, Text value, Context context) throws 
IOException, InterruptedException
{
int intMapConf = 
context.getConfiguration().getInt("number", -999);
System.out.println("map numbers: " + intMapConf+" 
"+intConfConf);
}
}

public static void main(String[] args) throws Exception
{
int res = ToolRunner.run(new Configuration(), new Conf2Test(), args);
System.exit(res);

}

public int run(String[] arg0) throws Exception {
Path Input_Directory = new Path("testIn");
Path Output_Directory = new Path("testOut");

Configuration conf = new Configuration();
Job job = new Job(conf, 
Conf2Test.class.getSimpleName());
job.setJarByClass(Conf2Test.class);
job.setMapperClass(MapClass.class);
job.setMapOutputKeyClass(Text.class);
   

Problem with hadoop decommission

2011-10-04 Thread trang van anh

Dear all,

I setup hadoop cluster with following struture

1 Namenode named server1;
4 datanodes named server1,server2,server3,server4.

Hadoop cluster capacity: 128G
DFS used : 49G
i want to detach server1 from cluster but time consumed slowly. I don't 
know why?


Any idea for me

Thanks in advance.