Inconsistent state in JobTracker (cdh)

2012-11-20 Thread Jan Lukavský

Hi all,

we are time to time experiencing a little odd behavior of JobTracker 
(using cdh release, currently on cdh3u3, but I suppose this affects at 
least all cdh3 releases so far). What we are seeing is M/R job beeing 
stuck between map and reduce phase, with 100% maps completed but the web 
UI reports 1 running map task and since we 
have**mapred.reduce.slowstart.completed.maps set to 1.0 (because of 
better throughput of jobs) the reduce phase will never start and the job 
has to be killed. I have investigated this a bit and I think I have 
found the reason for this.


12/11/20 01:05:10 INFO mapred.JobInProgress: Task 
'attempt_201211011002_1852_m_007638_0' has completed 
task_201211011002_1852_m_007638 successfully.
12/11/20 01:05:10 WARN hdfs.DFSClient: DataStreamer Exception: 
org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease 
on some output path File does not exist. [Lease. Holder: 
DFSClient_408514838, pendingcreates: 1]


12/11/20 01:05:10 WARN hdfs.DFSClient: Error Recovery for block 
blk_-1434919284750099885_670717751 bad datanode[0] nodes == null
12/11/20 01:05:10 WARN hdfs.DFSClient: Could not get block locations. 
Source file some output path - Aborting...
12/11/20 01:05:10 INFO mapred.JobHistory: Logging failed for job 
job_201211011002_1852removing PrintWriter from FileManager
12/11/20 01:05:10 ERROR security.UserGroupInformation: 
PriviledgedActionException as:mapred (auth:SIMPLE) 
cause:java.io.IOException: java.util.ConcurrentModificationException
12/11/20 01:05:10 INFO ipc.Server: IPC Server handler 7 on 9001, call 
heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1256e5f6, false, 
false, true, -17988) from 10.2.73.35:44969: error: java.io.IOException: 
java.util.ConcurrentModificationException



When I look to the source code for JobInProgress.completedTask(), I see 
the log about successful competion of the task, and after that, the 
logging in HDFS (JobHistory.Task.logFinished()). I suppose that if this 
call throws an exception (like in the case above), the call to 
completedTask() is aborted *before* the counters runningMapTasks and 
finishedMapTasks are updated accordingly. I created a heap dump of the 
JobTracker and I really found the counter runningMapTasks set to 1 and 
finishedMapTasks was equal to numMapTasks - 1.


Now, the question is, should this be handled in the JobTracker (say by 
moving the logging code after the counter manipulation)? Or should the 
TaskTracker re-report the completed task on error in JobTracker? What 
can cause the LeaseExpiredException? Should a JIRA be filled? :)


Thanks for comments,
 Jan




Start time, end time, and task tracker of individual tasks of a job

2012-11-20 Thread Jeff LI
Hello,

Is there a way to obtain the information of each individual task of a
map-reduce job, including start time, end time, which task tracker runs
this task and so on?

I know this information can be found through the web interface running on
the jobtracter.  But is it possible to redirect the information to a nicely
formatted log file for each job?

By the way, I'm running hadoop 0.20.2-cdh3u5.

Thanks advance for the help.

Cheers

Jeff


block size

2012-11-20 Thread Kartashov, Andy
Guys,

After changing property of block size from 64 to 128Mb, will I need to 
re-import data or will running hadoop balancer will resize blocks in hdfs?

Thanks,
AK

NOTICE: This e-mail message and any attachments are confidential, subject to 
copyright and may be privileged. Any unauthorized use, copying or disclosure is 
prohibited. If you are not the intended recipient, please delete and contact 
the sender immediately. Please consider the environment before printing this 
e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont 
confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le 
secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est 
interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, 
supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? 
l'environnement avant d'imprimer le pr?sent courriel


RE: block size

2012-11-20 Thread Kartashov, Andy
Cheers!

From: Kai Voigt [mailto:k...@123.org]
Sent: Tuesday, November 20, 2012 11:34 AM
To: user@hadoop.apache.org
Subject: Re: block size

Hi,

Am 20.11.2012 um 17:31 schrieb Kartashov, Andy 
andy.kartas...@mpac.camailto:andy.kartas...@mpac.ca:


After changing property of block size from 64 to 128Mb, will I need to 
re-import data or will running hadoop balancer will resize blocks in hdfs?

the blocksize affects new files only, existing files will not be modified. As 
you said, you need to re-import those old files if you want to store them with 
the new blocksize.

Kai

--
Kai Voigt
k...@123.orgmailto:k...@123.org




NOTICE: This e-mail message and any attachments are confidential, subject to 
copyright and may be privileged. Any unauthorized use, copying or disclosure is 
prohibited. If you are not the intended recipient, please delete and contact 
the sender immediately. Please consider the environment before printing this 
e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont 
confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le 
secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est 
interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, 
supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? 
l'environnement avant d'imprimer le pr?sent courriel


Re: Start time, end time, and task tracker of individual tasks of a job

2012-11-20 Thread Harsh J
Hey Jeff,

Yes, we expose some information for each task completion event..

For Old API, use RunningJob, specifically:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getTaskCompletionEvents(int)

For New API, use Job, specifically:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getTaskCompletionEvents(int)

On Tue, Nov 20, 2012 at 9:43 PM, Jeff LI uniquej...@gmail.com wrote:
 Hello,

 Is there a way to obtain the information of each individual task of a
 map-reduce job, including start time, end time, which task tracker runs this
 task and so on?

 I know this information can be found through the web interface running on
 the jobtracter.  But is it possible to redirect the information to a nicely
 formatted log file for each job?

 By the way, I'm running hadoop 0.20.2-cdh3u5.

 Thanks advance for the help.

 Cheers

 Jeff




-- 
Harsh J


Re: debugging hadoop streaming programs (first code)

2012-11-20 Thread Vinod Kumar Vavilapalli

The mapreduce webUI gives you all the information you need for debugging you 
code. Depending on where your JobTracker is, you should go hit 
$JT_HOST_NAME:50030. And check the job link as well the task, taskattempt and 
logs pages.

HTH
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Nov 20, 2012, at 5:33 AM, jamal sasha wrote:

 Hi,
If I just use pipes, then the code runs just fine.. the issue is when I 
 deploy it on clusters...
 :(
 Any suggestions on how to debug it.
 
 
 On Tue, Nov 20, 2012 at 7:42 AM, Mahesh Balija balijamahesh@gmail.com 
 wrote:
 Hi Jamal,
 
   You can debug your MapReduce program if it is written in java code, 
 by running your MR job in LocalRunner mode via eclipse.
   Or even you can have some debug statements (or even S.O.Ps) written 
 in your code so that you can check where your job fails.
 
   But I am NOT sure for python, but one suggestion is can you run 
 your Python code (Map unit  reduce unit) locally on your input data and see 
 whether your logic has any issues.
 
 Best,
 Mahesh Balija,
 Calsoft Labs.
 
 
 On Tue, Nov 20, 2012 at 6:50 AM, jamal sasha jamalsha...@gmail.com wrote:
 
 
 
 Hi,
   This is my first attempt to learn the map reduce abstraction.
 
 My problem is as follows
 I have a text file as follows:
 id 1, id2, date,time,mrps,code,code2
 3710100022400,1350219887, 2011-09-10, 12:39:38.000, 99.00, 1, 0 
 3710100022400, 5045462785, 2011-09-06, 13:23:00.000, 70.63, 1, 0 
 
 Now what I want is to do is to count the number of transaction happening in 
 every half an hour between 7 am and 11 am.
 So here are the  intervals.
 
 7-7:30 -0
 7:30-8 - 1
 8-8:30-2
 
 10:30-11-7
 So ultimately what I am doing is creating a 2d dictionary 
 d[id2][interval] = count_transactions.
 
 My mappers and reducers are attached (sample input also).
 The code run just fine if i run via
 cat input.txt | python mapper.py | sort | python reducer.py
 
 Gives me the output but when i run it on clusters.. it throws an error which 
 is not helpful (basically on the terminal it says job unsuccesful reason NA).
 Any suggestion on what am i doing wrong.
 
 Jamal 
 
 
 
 
 
 
 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Start time, end time, and task tracker of individual tasks of a job

2012-11-20 Thread Vinod Kumar Vavilapalli

Most of this information is already available in the JobHistory files. And 
there are parsers to read from these files.

HTH
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Nov 20, 2012, at 8:13 AM, Jeff LI wrote:

 Hello,
 
 Is there a way to obtain the information of each individual task of a 
 map-reduce job, including start time, end time, which task tracker runs this 
 task and so on? 
 
 I know this information can be found through the web interface running on the 
 jobtracter.  But is it possible to redirect the information to a nicely 
 formatted log file for each job? 
 
 By the way, I'm running hadoop 0.20.2-cdh3u5.
 
 Thanks advance for the help.
 
 Cheers
 
 Jeff
 



signature.asc
Description: Message signed with OpenPGP using GPGMail


number of reducers

2012-11-20 Thread jamal sasha
Hi,

  I wrote a simple map reduce job in hadoop streaming.



I am wondering if I am doing something wrong ..

While number of mappers are projected to be around 1700.. reducers.. just 1?

It’s couple of TB’s worth of data.

What can I do to address this.

Basically mapper looks like this



For line in sys.stdin:

Print line



Reducer

For line in sys.stdin:

New_line = process_line(line)

Print new_line





Thanks


Re: number of reducers

2012-11-20 Thread jamal sasha
Awesome thanks . Works great now

On Tuesday, November 20, 2012, Bejoy KS bejoy.had...@gmail.com wrote:
 Hi Sasha

 By default the number or reducers are set to be 1. If you want more you
need to specify it as

 hadoop jar myJar.jar myClass -D mapred.reduce.tasks=20 ...

 Regards
 Bejoy KS

 Sent from handheld, please excuse typos.
 
 From: jamal sasha jamalsha...@gmail.com
 Date: Tue, 20 Nov 2012 14:38:54 -0500
 To: user@hadoop.apache.org
 ReplyTo: user@hadoop.apache.org
 Subject: number of reducers


 Hi,

   I wrote a simple map reduce job in hadoop streaming.



 I am wondering if I am doing something wrong ..

 While number of mappers are projected to be around 1700.. reducers.. just
1?

 It’s couple of TB’s worth of data.

 What can I do to address this.

 Basically mapper looks like this



 For line in sys.stdin:

 Print line



 Reducer

 For line in sys.stdin:

 New_line = process_line(line)

 Print new_line





 Thanks





RE: number of reducers

2012-11-20 Thread Kartashov, Andy
I specify mine inside mapred-site.xml

property
namemapred.reduce.tasks/name
value20/value
  /property

Rgds,
AK47
From: Bejoy KS [mailto:bejoy.had...@gmail.com]
Sent: Tuesday, November 20, 2012 3:10 PM
To: user@hadoop.apache.org
Subject: Re: number of reducers

Hi Sasha

By default the number or reducers are set to be 1. If you want more you need to 
specify it as

hadoop jar myJar.jar myClass -D mapred.reduce.tasks=20 ...
Regards
Bejoy KS

Sent from handheld, please excuse typos.

From: jamal sasha jamalsha...@gmail.com
Date: Tue, 20 Nov 2012 14:38:54 -0500
To: user@hadoop.apache.org
ReplyTo: user@hadoop.apache.org
Subject: number of reducers



Hi,

  I wrote a simple map reduce job in hadoop streaming.



I am wondering if I am doing something wrong ..

While number of mappers are projected to be around 1700.. reducers.. just 1?

It's couple of TB's worth of data.

What can I do to address this.

Basically mapper looks like this



For line in sys.stdin:

Print line



Reducer

For line in sys.stdin:

New_line = process_line(line)

Print new_line





Thanks

NOTICE: This e-mail message and any attachments are confidential, subject to 
copyright and may be privileged. Any unauthorized use, copying or disclosure is 
prohibited. If you are not the intended recipient, please delete and contact 
the sender immediately. Please consider the environment before printing this 
e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont 
confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le 
secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est 
interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, 
supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? 
l'environnement avant d'imprimer le pr?sent courriel


Re: number of reducers

2012-11-20 Thread alxsss

 What is the relationship between number of reducers and cpu cores in your 
setup? I read somewhere that it must be .5 of number of cpu cores.

Thanks.
Alex.

 

 

-Original Message-
From: Kartashov, Andy andy.kartas...@mpac.ca
To: user user@hadoop.apache.org; bejoy.hadoop bejoy.had...@gmail.com
Sent: Tue, Nov 20, 2012 1:51 pm
Subject: RE: number of reducers



I specify mine inside mapred-site.xml
 
property
namemapred.reduce.tasks/name
value20/value
  /property
 
Rgds,
AK47

From: Bejoy KS [mailto:bejoy.had...@gmail.com]
Sent: Tuesday, November 20, 2012 3:10 PM
To: user@hadoop.apache.org
Subject: Re: number of reducers

 
Hi Sasha

By default the number or reducers are set to be 1. If you want more you need to 
specify it as

hadoop jar myJar.jar myClass -D mapred.reduce.tasks=20 ...

Regards
Bejoy KS

Sent from handheld, please excuse typos.



From: jamal sasha jamalsha...@gmail.com 

Date: Tue, 20 Nov 2012 14:38:54 -0500

To: user@hadoop.apache.org

ReplyTo: user@hadoop.apache.org 

Subject: number of reducers

 



Hi,

  I wrote a simple map reduce job in hadoop streaming.

 

I am wondering if I am doing something wrong ..

While number of mappers are projected to be around 1700.. reducers.. just 1?

It’s couple of TB’s worth of data.

What can I do to address this.

Basically mapper looks like this

 

For line in sys.stdin:

Print line

 

Reducer

For line in sys.stdin:

New_line = process_line(line)

Print new_line

 

 

Thanks


NOTICE: This e-mail message and any attachments are confidential, subject to 
copyright and may be privileged. Any unauthorized use, copying or disclosure is 
prohibited. If you are not the intended recipient, please delete and contact 
the sender immediately. Please consider the environment before printing this 
e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont 
confidentiels, protégés par le droit d'auteur et peuvent être couverts par le 
secret professionnel. Toute utilisation, copie ou divulgation non autorisée est 
interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, 
supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à 
l'environnement avant d'imprimer le présent courriel
 


problem with upgrading from HDFS 0.21 to HDFS 1.0.4

2012-11-20 Thread rongshen.long
hi all,
It seems it's not supported  to upgrade hadoop from 0.21 to the stable version 
1.0.4 . The 'linkBlocks' function in the 'DataStorage.java'(v1.0.4) can not 
work well , because the datanode storage structure of the former is different 
from the latter ,there are finalized and rbw directorys under 
$dfs.datanode.data.dir/current.
Do you have some suggestions to deal with this problem? 

2012-11-20



rongshen.long

Re: number of reducers

2012-11-20 Thread Harsh J
Hey Jamal,

I'd recommend first going over the whole tutorial to get a good grip
on how Hadoop MR is designed to work:
http://hadoop.apache.org/docs/stable/mapred_tutorial.html

On Wed, Nov 21, 2012 at 1:08 AM, jamal sasha jamalsha...@gmail.com wrote:


 Hi,

   I wrote a simple map reduce job in hadoop streaming.



 I am wondering if I am doing something wrong ..

 While number of mappers are projected to be around 1700.. reducers.. just 1?

 It’s couple of TB’s worth of data.

 What can I do to address this.

 Basically mapper looks like this



 For line in sys.stdin:

 Print line



 Reducer

 For line in sys.stdin:

 New_line = process_line(line)

 Print new_line





 Thanks





-- 
Harsh J


Supplying a jar for a map-reduce job

2012-11-20 Thread Pankaj Gupta
Hi,

I am running map-reduce jobs on Hadoop 0.23 cluster. Right now I supply the jar 
to use for running the map-reduce job using the setJarByClass function on 
org.apache.hadoop.mapreduce.Job. This makes my code depend on a class in the MR 
job at compile. What I want is to be able to run an MR job without being 
dependent on it at compile time. It would be great if I could use a jar that 
contains the Mapper and Reducer classes and just pass it to run the map reduce 
job. That would make it easy to choose an MR job to run at runtime. Is that 
possible?


Thanks in Advance,
Pankaj

Re: Supplying a jar for a map-reduce job

2012-11-20 Thread Bejoy KS
Hi Pankaj

AFAIK You can do the same. Just provide the properties like mapper class, 
reducer class, input format, output format etc using -D option at run time.



Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: Pankaj Gupta pan...@brightroll.com
Date: Tue, 20 Nov 2012 20:49:29 
To: user@hadoop.apache.orguser@hadoop.apache.org
Reply-To: user@hadoop.apache.org
Subject: Supplying a jar for a map-reduce job

Hi,

I am running map-reduce jobs on Hadoop 0.23 cluster. Right now I supply the jar 
to use for running the map-reduce job using the setJarByClass function on 
org.apache.hadoop.mapreduce.Job. This makes my code depend on a class in the MR 
job at compile. What I want is to be able to run an MR job without being 
dependent on it at compile time. It would be great if I could use a jar that 
contains the Mapper and Reducer classes and just pass it to run the map reduce 
job. That would make it easy to choose an MR job to run at runtime. Is that 
possible?


Thanks in Advance,
Pankaj

ISSUE while configuring ECLIPSE with MAP-REDUCE

2012-11-20 Thread yogesh dhari

Hi Hadoop Champs,

I am facing this issue while trying to configure Eclipse with Map-Reduce.

Exception in thread main java.lang.Error: Unresolved compilation problems: 
The method setInputFormat(Class? extends InputFormat) in the type JobConf 
is not applicable for the arguments (ClassTextInputFormat)
The method setOutputFormat(Class? extends OutputFormat) in the type 
JobConf is not applicable for the arguments (ClassTextOutputFormat)
The method setInputPaths(Job, String) in the type FileInputFormat is not 
applicable for the arguments (JobConf, Path)
The method setOutputPath(Job, Path) in the type FileOutputFormat is not 
applicable for the arguments (JobConf, Path)

at TestDriver.main(TestDriver.java:30)




I have these classes and flow pattern.


import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter;



public class TestDriver {

public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(TestDriver.class);

// TODO: specify output types
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

// TODO: specify input and output DIRECTORIES (not files)
//conf.setInputPath(new Path(src));
//conf.setOutputPath(new Path(out));

conf.setInputFormat(TextInputFormat.class);  /*   ERROR shown is :: The 
method setInputFormat(Class? extends InputFormat) in the type JobConf is not 
applicable for the  arguments (ClassTextInputFormat) */

conf.setOutputFormat(TextOutputFormat.class);   /*  ERROR shown is :: 
The method setOutputFormat(Class? extends OutputFormat) in the type JobConf 
is not applicable for the  arguments (ClassTextOutputFormat)  */

FileInputFormat.setInputPaths(conf, new Path(In));  /* ERROR shown is 
::  The method setInputPaths(Job, String) in the type FileInputFormat is not 
applicable for the arguments (JobConf, Path)  */


FileOutputFormat.setOutputPath(conf, new Path(Out));  /* ERROR shown 
is :: The method setOutputPath(Job, Path) in the type FileOutputFormat is not 
applicable for the arguments (JobConf, Path)   */ 


// TODO: specify a mapper
conf.setMapperClass(org.apache.hadoop.mapred.lib.IdentityMapper.class);

// TODO: specify a reducer

conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);

client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}

}


Please suggest  Help

Thanks  Regards
Yogesh Kumar

  

RE: ISSUE while configuring ECLIPSE with MAP-REDUCE

2012-11-20 Thread yogesh dhari

I am using Apache Hadoop-0.20.2 

Regards
Yogesh Kumar

From: yogeshdh...@live.com
To: user@hadoop.apache.org
Subject: ISSUE while configuring ECLIPSE with MAP-REDUCE
Date: Wed, 21 Nov 2012 11:17:42 +0530





Hi Hadoop Champs,

I am facing this issue while trying to configure Eclipse with Map-Reduce.

Exception in thread main java.lang.Error: Unresolved compilation problems: 
The method setInputFormat(Class? extends InputFormat) in the type JobConf 
is not applicable for the arguments (ClassTextInputFormat)
The method setOutputFormat(Class? extends OutputFormat) in the type 
JobConf is not applicable for the arguments (ClassTextOutputFormat)
The method setInputPaths(Job, String) in the type FileInputFormat is not 
applicable for the arguments (JobConf, Path)
The method setOutputPath(Job, Path) in the type FileOutputFormat is not 
applicable for the arguments (JobConf, Path)

at TestDriver.main(TestDriver.java:30)




I have these classes and flow pattern.


import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter;



public class TestDriver {

public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(TestDriver.class);

// TODO: specify output types
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

// TODO: specify input and output DIRECTORIES (not files)
//conf.setInputPath(new Path(src));
//conf.setOutputPath(new Path(out));

conf.setInputFormat(TextInputFormat.class);  /*   ERROR shown is :: The 
method setInputFormat(Class? extends InputFormat) in the type JobConf is not 
applicable for the  arguments (ClassTextInputFormat) */

conf.setOutputFormat(TextOutputFormat.class);   /*  ERROR shown is :: 
The method setOutputFormat(Class? extends OutputFormat) in the type JobConf 
is not applicable for the  arguments (ClassTextOutputFormat)  */

FileInputFormat.setInputPaths(conf, new Path(In));  /* ERROR shown is 
::  The method setInputPaths(Job, String) in the type FileInputFormat is not 
applicable for the arguments (JobConf, Path)  */


FileOutputFormat.setOutputPath(conf, new Path(Out));  /* ERROR shown 
is :: The method setOutputPath(Job, Path) in the type FileOutputFormat is not 
applicable for the arguments (JobConf, Path)   */ 


// TODO: specify a mapper
conf.setMapperClass(org.apache.hadoop.mapred.lib.IdentityMapper.class);

// TODO: specify a reducer

conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);

client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}

}


Please suggest  Help

Thanks  Regards
Yogesh Kumar


  

Is there an additional overhead when storing data in HDFS?

2012-11-20 Thread WangRamon
Hi All I'm wondering if there is an additional overhead when storing some data 
into HDFS? For example, I have a 2GB file, the replicate factor of HDSF is 2, 
when the file is uploaded to HDFS, should HDFS use 4GB to store it or more then 
4GB to store it? If it takes more than 4GB space, why? ThanksRamon  
 

Re: Is there an additional overhead when storing data in HDFS?

2012-11-20 Thread Suresh Srinivas
HDFS uses 4GB for the file + checksum data.

Default is for every 512 bytes of data, 4 bytes of checksum are stored. In
this case additional 32MB data.

On Tue, Nov 20, 2012 at 11:00 PM, WangRamon ramon_w...@hotmail.com wrote:

 Hi All

 I'm wondering if there is an additional overhead when storing some data
 into HDFS? For example, I have a 2GB file, the replicate factor of HDSF is
 2, when the file is uploaded to HDFS, should HDFS use 4GB to store it or
 more then 4GB to store it? If it takes more than 4GB space, why?

 Thanks
 Ramon




-- 
http://hortonworks.com/download/


RE: Is there an additional overhead when storing data in HDFS?

2012-11-20 Thread WangRamon
Thanks, besides the checksum data is there anything else? Data in name node?
 Date: Tue, 20 Nov 2012 23:14:06 -0800
Subject: Re: Is there an additional overhead when storing data in HDFS?
From: sur...@hortonworks.com
To: user@hadoop.apache.org

HDFS uses 4GB for the file + checksum data.
Default is for every 512 bytes of data, 4 bytes of checksum are stored. In this 
case additional 32MB data.

On Tue, Nov 20, 2012 at 11:00 PM, WangRamon ramon_w...@hotmail.com wrote:




Hi All
 
I'm wondering if there is an additional overhead when storing some data into 
HDFS? For example, I have a 2GB file, the replicate factor of HDSF is 2, when 
the file is uploaded to HDFS, should HDFS use 4GB to store it or more then 4GB 
to store it? If it takes more than 4GB space, why?

 
Thanks
Ramon 
  


-- 
 http://hortonworks.com/download/


  

Re: Is there an additional overhead when storing data in HDFS?

2012-11-20 Thread Mohammad Tariq
Hello Ramon,

 Why don't you go through this link once :
http://www.aosabook.org/en/hdfs.html
Suresh and guys have explained everything beautifully.

HTH

Regards,
Mohammad Tariq



On Wed, Nov 21, 2012 at 12:58 PM, Suresh Srinivas sur...@hortonworks.comwrote:

 Namenode will have trivial amount of data stored in journal/fsimage.


 On Tue, Nov 20, 2012 at 11:21 PM, WangRamon ramon_w...@hotmail.comwrote:

 Thanks, besides the checksum data is there anything else? Data in name
 node?

 --
 Date: Tue, 20 Nov 2012 23:14:06 -0800
 Subject: Re: Is there an additional overhead when storing data in HDFS?
 From: sur...@hortonworks.com
 To: user@hadoop.apache.org


 HDFS uses 4GB for the file + checksum data.

 Default is for every 512 bytes of data, 4 bytes of checksum are stored.
 In this case additional 32MB data.

 On Tue, Nov 20, 2012 at 11:00 PM, WangRamon ramon_w...@hotmail.comwrote:

 Hi All

 I'm wondering if there is an additional overhead when storing some data
 into HDFS? For example, I have a 2GB file, the replicate factor of HDSF is
 2, when the file is uploaded to HDFS, should HDFS use 4GB to store it or
 more then 4GB to store it? If it takes more than 4GB space, why?

 Thanks
 Ramon




 --
 http://hortonworks.com/download/




 --
 http://hortonworks.com/download/